Vectorization in R

Vectorization in R#

For further information, refer to Section 24.5 of Improving Performance by Hadley Wickham.

Overview#

R is an interpreted language, meaning it is not compiled into machine code before execution like in C or C++. If you really want to dig into the difference, you could start with this thread. If you don’t know what this means, don’t panic. All we need to know is that loops, especially nested loops are typically slow in interpreted languages. That is, they should be avoided if possible in R.

Although vectorising your code is more than just avoiding loops, one aspect of vectorisation is avoiding loops. In particular, I mean avoiding loops of R code. One way of avoiding loops in R is through vectorized functions, which have their loops implemented in C under-the-hood. A vectorized function operates on every element of the vector at once, avoiding looping through the vector, in R, and operating element-by-element. By exploiting vectorization we can potentially write faster R code, and it is almost certain you have been doing this without even knowing!

Examples#

Suppose we want to calculate the cumulative sum of a vector $x$. We could write a function in R using a for loop as follows:

%%R
cumsum_R <- function(x) {
  len_x <- length(x)
  
  # best practice to initialise a vector of known length, rather than append
  out <- numeric(len_x)
  
  curr_total <- 0
  
  for (i in 1:len_x) {
    curr_total <- curr_total + x[i]
    out[i] <- curr_total
  }
  
  out
  
}

cumsum_R(1:10)

 [1]  1  3  6 10 15 21 28 36 45 55

In R there is a built in function called cumsum() that can do this for us. You can view the documentation by executing ?cumsum() in your session, and you will see there are additional members of this family e.g. cumprod(). By executing cumsum without brackets in your session you can see that the cumsum() function is calling .Primitive; this means that R is calling a function which is implemented entirely in C, so it is going to be fast.

%%R
cumsum

function (x)  .Primitive("cumsum")

Let’s compare our own naive implementation against the cumsum() function provided in the base R package.

%%R
test_x <- 1:1000
results <- microbenchmark(cumsum(test_x), cumsum_R(test_x))
summary(results)

              expr    min     lq     mean  median      uq    max neval
1   cumsum(test_x)  1.804  1.968  2.13036  2.0295  2.2550  2.952   100
2 cumsum_R(test_x) 29.356 29.766 31.12556 30.0120 30.4835 41.574   100

In addition: Warning message:
In microbenchmark(cumsum(test_x), cumsum_R(test_x)) :
  less accurate nanosecond times to avoid potential integer overflows

As you can see, our naive implementation is extremely slow in comparison!

Exercise#

Try to write your own version of the rowMeans function in R, using for loops, and compare it with the vectorised version in R. To test your function you can use the following test_matrix matrix below.

%%R
test_matrix <- matrix(1:100,10,10)
rowMeans(test_matrix)

# Write your code here

# Benchmark here

 [1] 46 47 48 49 50 51 52 53 54 55

Vectorization in R

Contents

Vectorization in R#

Overview#

Examples#

Exercise#