%load_ext rpy2.ipython
/Users/cc1333/Library/Python/3.9/lib/python/site-packages/rpy2/ipython/rmagic.py:77: UserWarning: The Python package `pandas` is strongly recommended when using `rpy2.ipython`. Unfortunately it could not be loaded (error: No module named 'pandas'), but at least we found `numpy`.
warnings.warn('The Python package `pandas` is strongly '
%%R
suppressPackageStartupMessages({
suppressMessages({
library(microbenchmark)
})
})
Vectorization in R#
For further information, refer to Section 24.5 of Improving Performance by Hadley Wickham.
Overview#
R is an interpreted language, meaning it is not compiled into machine code before execution like in C or C++. If you really want to dig into the difference, you could start with this thread. If you don’t know what this means, don’t panic. All we need to know is that loops, especially nested loops are typically slow in interpreted languages. That is, they should be avoided if possible in R.
Although vectorising your code is more than just avoiding loops, one aspect of vectorisation is avoiding loops. In particular, I mean avoiding loops of R code. One way of avoiding loops in R is through vectorized functions, which have their loops implemented in C under-the-hood. A vectorized function operates on every element of the vector at once, avoiding looping through the vector, in R, and operating element-by-element. By exploiting vectorization we can potentially write faster R code, and it is almost certain you have been doing this without even knowing!
Examples#
Suppose we want to calculate the cumulative sum of a vector \(x\). We could write a function in R
using a for loop as follows:
%%R
cumsum_R <- function(x) {
len_x <- length(x)
# best pracitce to initialise a vector of known length, rather than append
out <- numeric(len_x)
curr_total <- 0
for (i in 1:len_x) {
curr_total <- curr_total + x[i]
out[i] <- curr_total
}
out
}
cumsum_R(1:10)
[1] 1 3 6 10 15 21 28 36 45 55
In R there is a built in function called cumsum()
that can do this for us. You can view the documentation by executing ?cumsum()
in your session, and you will see there are additional members of this family e.g. cumprod()
. By executing cumsum
without brackets in your session you can see that the cumsum()
function is calling .Primitive
; this means that R is calling a function which is implemented entirely in C, so it is going to be fast.
%%R
cumsum
function (x) .Primitive("cumsum")
Let’s compare our own naive implementation against the cumsum()
function provided in the base
R package.
%%R
test_x <- 1:1000
results <- microbenchmark(cumsum(test_x), cumsum_R(test_x))
summary(results)
expr min lq mean median uq max neval
1 cumsum(test_x) 1.804 1.968 2.13036 2.0295 2.2550 2.952 100
2 cumsum_R(test_x) 29.356 29.766 31.12556 30.0120 30.4835 41.574 100
In addition: Warning message:
In microbenchmark(cumsum(test_x), cumsum_R(test_x)) :
less accurate nanosecond times to avoid potential integer overflows
As you can see, our naive implementation is extremely slow in comparison!
Exercise#
Try to write your own version of the rowMeans
function in R, using for loops, and compare it with the vectorised version in R. To test your function you can use the following test_matrix
matrix below.
%%R
test_matrix <- matrix(1:100,10,10)
rowMeans(test_matrix)
# Write your code here
# Benchmark here
[1] 46 47 48 49 50 51 52 53 54 55