Reproducibility in R

Photo credit: iStock

The statistical programming language R can be used to make large tasks more manageable and semi-automated and create reusable code for repeated tasks. R can be used to process data or to build statistical and machine learning models in order to help predict outcomes and measure the impact of certain actions on your business goals. One thing that is great about R is the ability to create reproducible code, so that others can replicate the analyses you run with zero headaches required. Here are a few tips for doing this.

Always include your data with your R Code

Whether this means sending file structures and setting up your working directory structure appropriately as seen here, storing your data on the web and accessing using one of the many R packages available (for example, on Google drive and access using one the R package googlesheets), your data should be readily available for whoever is going to be running your R code. This is kind of a no brainer, since without the data, how will your coworkers be able to run your code and ensure that it works?

Setting up your working directory or using an R package to connect to an external datasource ensures that no one will have to reset the “setwd” statements in your code, rather, it will run on anyone’s computers. This is a super desirable trait of reproducible code.

Comment and Document Your code

Just as when you are sharing your R code, it is a really good idea to comment on your code. This will let anyone reading it know how your code works, and why you may have written it the way you did.

I like to provide documentation on how to run any program or process. This ensures that when I pass off a program or process, whoever is taking it on is well equipped with everything they need to do the task: both the program and the instructions.

Using knitr/Rmarkdown to Document Your Process

Levelling up on the documentation side is using knitr or Rmarkdown to create a notebook interface. This brings together your documentation and process with chunks of your R code into an HTML file that is both easy to use and fully reproducible.

By putting all your code and your documentation “how to” in one place, this makes things easier for others to reproduce.

The main idea with reproducible code is to create something that anyone running your code in the future will be able to run with no errors and to get the same results you did. This ensures that everyone knows what the ‘true’ results are, as your code gives the same result on everyone’s computers. It also makes sharing your R code easier. Through setting up your R code and data right, documenting, and using Rmarkdown or knitr, you can achieve reproducible code that anyone at your company using R will be able to run with ease.

Reproducibility should be acknowledged regardless of what tool or programming language you are using, so keeping these same principles in mind and applying them to the tool at hand can make everyone’s life easier.