Acceleration via Rcpp (advanced)

%load_ext rpy2.ipython
/Users/cc1333/Library/Python/3.9/lib/python/site-packages/rpy2/ipython/rmagic.py:77: UserWarning: The Python package `pandas` is strongly recommended when using `rpy2.ipython`. Unfortunately it could not be loaded (error: No module named 'pandas'), but at least we found `numpy`.
  warnings.warn('The Python package `pandas` is strongly '
%%R
suppressPackageStartupMessages({
  suppressMessages({
    library(Rcpp)
    library(microbenchmark)
  })
})

Acceleration via Rcpp (advanced)#

Overview#

C++ is a high-performance, general-purpose programming language that extends C with object-oriented, generic, and functional programming features. If you don’t know what those are, or have never used C++ before, do not worry. For us the most important point to note is that it can provide a fast alternative to R and can be integrated into our R code. In fact, under the hood, many of the base R functions are themselves written in C. In addition, many R packages on CRAN contain functions written in C++.

The easiest way to use C++ code in an R environment is through the Rcpp package. On MacOS and Linux systems, the Rcpp package is likely to be setup and loaded correctly after executing install.packages('Rcpp'), followed by library(Rcpp). On a windows OS, you may need to follow the instructions here to download Xtools.

You can check Rcpp is set up correctly by trying the following:

%%R
evalCpp("2+2")
[1] 4

Example: Estimating Pi#

Suppose we want to estimate the value of pi. One way of doing this is to use Monte Carlo sampling.

Consider a quarter-circle of radius 1 inside the unit square \( [0,1] \times [0,1].\) The equation for the quarter-circle is

\[x^2 + y^2 \leq 1.\]

Since the area of the full circle is \( \pi \), the quarter-circle has an area of \( \pi/4 \). By randomly sampling points inside the square and checking how many fall inside the quarter-circle, we can estimate \( \pi \) using:

\[ \pi \approx 4 \times \frac{M}{N} \]

where \( M \) is the number of points inside the quarter-circle, and \( N \) is the total number of points sampled.

We can plot this set-up:

%%R
# Define the quarter-circle function
theta <- seq(0, pi/2, length.out = 100)
x <- cos(theta)
y <- sin(theta)

samps <- runif(1000) |>
  matrix(ncol = 2)

inside <- samps[sqrt(rowSums(samps^2)) < 1,]
outside <- samps[sqrt(rowSums(samps^2)) >= 1,]

# Plot the quarter-circle
plot(x, y, type = "l", lwd = 2, col = 1, xlim = c(0,1), ylim = c(0,1),
     xlab = "x", ylab = "y", main = "Estimating Pi",
     xpd = FALSE, bty='l')
points(x = inside[,1], y = inside[,2], pch = 20, col = 1)
points(x = outside[,1], y = outside[,2], pch = 20, col = 2)
../../_images/85c2f2128db27c288effda72f2fe1a8c3588916c28ced5f52c7da7eb63e01c55.png

An implementation of this procedure in R is as follows:

%%R
pi_R <- function(n) {
  
  # we want n points, each point is in 2D space, so we need to sample m = 2*n
  # points using runif()
  m <-  2*n
  
  # get m samples - each sample is a row of the matrix 
  samples <- runif(m) |>
    matrix(ncol = 2)
  
  # how many lie inside the quarter circle?
  # get the distance between each row and the origin (0,0)
  n_inside_circle <- sum(sqrt(rowSums(samples^2)) < 1)
  
  # estimate pi by multiplying the proportion of points inside the quarter
  # circle by 4
  pi_estimate <- (n_inside_circle/n) * 4
}

Now, we can run the function for increasing \(n\), which should converge to the true value of \(\pi\).

%%R
# Estimates for different n
pi_estimate_1 <- pi_R(1)
pi_estimate_10 <- pi_R(10)
pi_estimate_100 <- pi_R(100)
pi_estimate_1000 <- pi_R(1000)
pi_estimate_10000 <- pi_R(10000)

pi_vec <- c(pi_estimate_1,
            pi_estimate_10,
            pi_estimate_100,
            pi_estimate_1000,
            pi_estimate_10000)

print(pi_vec)
[1] 4.0000 3.2000 3.1200 3.1640 3.1464

We can also plot the estimates, and compare them to the value of \(\pi\) contained in R, which is accessible through pi and is denoted by the dashed horizontal line.

%%R
plot(c(1,10,100,1000, 10000), 
     pi_vec, 
     type ='b', 
     pch = 20,
     xlab = "Number of Samples",
     ylab = "Estimate")
abline(h = pi, col = 4, lty = 2)
../../_images/17ba3347311893431d09cdadca1c3a38321c0c5bce23918a24929dc9b7b6c3db.png

Suppose we want to do the same in C++. We can write the C++ code in a seperate file with a .cpp, then call it within R using the sourceCpp() function provided in the Rcpp package. For completeness, the C++ source code contained in the file pi.cpp is as follows:

Loading the function into R is as simple as calling sourceCpp():

%%R
sourceCpp("pi.cpp")

We can now run the function:

%%R
pi_cpp(10000)
[1] 3.1368

Finally, let’s compare it with the R implementation using microbencmark:

%%R
results <- microbenchmark(pi_R(1e6), pi_cpp(1e6))
summary(results)
           expr       min        lq      mean    median       uq      max neval
1   pi_R(1e+06) 13.200319 13.955457 14.982102 14.191412 14.81519 26.64340   100
2 pi_cpp(1e+06)  8.339154  8.391716  8.683018  8.419248  8.49766 16.95912   100
In addition: Warning message:
In microbenchmark(pi_R(1e+06), pi_cpp(1e+06)) :
  less accurate nanosecond times to avoid potential integer overflows

The C++ version is about twice as fast as the R version using 1e6 samples.