Dependency and reproducibility

When code lets you down

Most statistical publications are updated on a scheduled basis when new data becomes available. Usually the code will run perfectly, but sometimes it will break. The code didn’t change, so what went wrong?

The data changed
Other people’s code changed

Data changes all the time, which is why your RAP project should check the data before it does anything else. Other people’s code also changes frequently. Unfortunately you can’t avoid this by only using your own code, because you would have to invent computers from scratch.

This article is about coping with changes to other people’s code, and writing code that other people can depend on.

Who is affected

Three groups of people will use your code:

The present day you
The future you
Other people at any time

The aims and starting points of each of these groups create different challenges. Different approaches can help you mitigate various problems that each group might face when using the scripts you’ve written.

Present day you wants to create your outputs today. Your starting point is the packages that you have installed on the tools you are using to produce your outputs.

Future you wants to re-use the code you’ve already written (some time ago) to update the outputs you’ve previously created. In between creating outputs you may have updated your version of the software you are using as well as the packages you are using to create those outputs. This can cause problems with backward dependencies if packages have not been consistent when creating their updates.

Other people at any time want to take your code that you’ve shared to recreate or update the outputs that you’ve previously created. Depending on how software and package management is performed in the organisation, this group may not have the same packages installed as you did when you originally shared your script.

Depend on code

When you use other people’s code, their code is called a “dependency”. It is impossible to avoid dependencies without reinventing computers from scratch. It also isn’t always obvious when your project has been affected by a change in a dependency.

You can mitigate the risks of having dependencies if you

notice problems with dependencies
choose a few reliable dependencies
know about changes in advance
control which dependencies are installed and their versions

These topics are explained in the following sections.

Notice problems with dependencies

Changes to dependencies might not impact your publication. If changes do have an impact, the best case is that you’ll get a helpful error message. At worst, your code will execute with an imperceptible but impactful error. Maybe a rounding function now rounds to the nearest 10 instead of the nearest 1000.

Here are some ways your code can be affected.

A dependency might have a new bug
A dependency might break an assumption that your code made about a previous version
There might be a newer and better way to do what the dependency does

Problems can be compounded if your dependencies depend on other dependencies, or if two of your dependencies require conflicting versions of a third dependency. If you have to change a dependency that has its own dependencies, then you could get stuck in so-called dependency hell.

Choose a few reliable dependencies

You are likely to make use of other people’s code when you develop your RAP project. Maybe you’ve imported packages to perform a statistical test, for example. These dependencies can be extremely helpful. They can:

prevent you from recreating code that already exists
save you time trying to solve a problem and optimise its solution
give you access to code and solutions from experts in the field
help to reduce the size of your scripts and make them more human-readable
limit the need for you to update and fix problems yourself

You can reduce the risk of problems by only depending on a few reliable packages. To put it succinctly, the tinyverse philosophy of dependency management suggests that:

Lightweight is the right weight

To achieve this you could:

minimise the number of dependencies and remove redundancy where possible
avoid depending on packages that in turn have many dependencies
restrict yourself to stable packages for which recent changes were restricted to minor updates and bug-fixes
review regularly your dependencies to establish if better alternatives exist
Record the packages and versions

It isn’t enough to merely minimise your dependencies. You need to think about how this impacts the reproducibility of your project. To ensure that your scripts are executed in the same way next time, you need to record the packages and their versions in some way. Then you or a colleague can recreate the environment in which the outputs were produced the first time round.

Know about changes in advance

Set aside time to learn about changes to dependencies and new developments relevant to your project.

Subscribe to alerts. This is easy to do if the dependency is hosted on GitHub or a package distribution network like CRAN or PyPi.
Read blog aggregators. Changes to packages are often written about in advance. By following a few blog aggregators you are likely to see relevant posts.
Follow developers on social media. Many developers discuss changes to their packages on social media, for example Twitter.

A side effect of these seemingly inefficient activities is that you will become better at your job. Consider contributing what you learn back to the community by posting in the Gov Data Science Slack, peer reviewing other people’s projects, and teaching people at coffee and coding.

Control which dependencies are installed and their versions

Maintainers signal updates by increasing the version number of their software. This could be a simple patch of an earlier version’s bug (e.g. version 3.2.7 replaces 3.2.6), or perhaps a major breaking change (e.g. version 3.2.6 is update to version 4.0.0).

Write down dependencies and their versions

There are many ways to record each of the packages used in your analysis and their version numbers.

In R you could, for example, use the session_info() function from the devtools package. This prints details about the current state of the working environment.

devtools::session_info()

You could do something like pkgs <- devtools::session_info()$packages to save a dataframe of the packages and versions.

You can achieve a similar thing for Python with pip freeze in a shell script.

> pip freeze
## alabaster==0.7.10
## anaconda-client==1.6.14
## anaconda-navigator==1.8.7
## anaconda-project==0.8.2
## appnope==0.1.0
## appscript==1.0.1
...

You can save this information with something like pip freeze > requirements.txt in the shell. The packages should be ‘pinned’ to specific versions, meaning that they’re in the form packageName==1.3.2 rather than packageName>=1.3.2. The point is to record specific versions, not specific versions or newer.

But simply saving this information in your project folder isn’t good dependency control. It:

would be tedious for analysts to read these reports and download each recorded package version one-by-one
records every package and its version on your whole system, not just the ones relevant to your project
isn’t a reproducible or automated process

The following sections explain how to use a list of dependencies more effectively.

Control which dependencies are installed and their versions

Ideally you should automate the process of recording packages and their version numbers and have them installed in an isolated environment that’s specific to your project. Doing this makes the project more portable – you could run it easily from another machine that’s configured differently to your own – and it would therefore be more reproducible.

Package managers in R

There is currently no consensus approach for package management in R. Below are a few options, but this is a non-exhaustive list.

The renv package is the newest package manager from RStudio. It is introduced briefly in the next section.

The packrat package is commonly used but has known problems. The RAP community has noted in particular that it has a problem compiling older package versions on Windows. Join the discussion for more information.

A packrat walkthrough is available, but the basic process is:

Activate ‘packrat mode’ in your project folder with init(), which records and snapshots the packages you’ve called in your scripts.
Install new packages as usual, except they’re now saved to a private package repository within the project, rather than your local machine.
By default, regular snapshots are taken to record the state of dependencies, but you can force one with snapshot().
When opening the project fresh on a new machine, Packrat automates the process of fetching the packages – with their recorded version numbers – and storing them in a private package library created on the collaborator’s machine.

As for other options, the checkpoint package from Microsoft’s Revolution Analytics works like packrat but you simply checkpoint() your project for a given date. This allows you to call the packages from that date into a private library for that project. It works by fetching the packages from the Microsoft R Application Network (MRAN), which is a daily snapshot of CRAN. Note that this doesn’t permit control of packages that are hosted anywhere other than CRAN, such as Bioconductor or GitHub, and relies on Microsoft continuing to snapshot and store CRAN copies in MRAN.

Another option is jetpack, which is different to packrat because it uses a DESCRIPTION file to list your dependencies. DESCRIPTION files are used in package development to store information, including that package’s dependencies. This is a lightweight option and can be run from the command line.

Paid options also exist, but are obviously less accessible and require maintenance. One example is RStudio’s Package Manager.

R package {renv}

In R, it is possible to manage your packages without a centralised organisational level package management systems. The renv package supports this.

The approach renv takes to package updates is:

Isolate a project
Record the current dependencies
Upgrade packages

This “snapshot and restore” approach can also be used to promote content to production. In fact, this approach is exactly how RStudio Connect and shinyapps.io deploy thousands of R applications to production each day.

Isolate the project from within the project directory by using the code

renv::init()

Record the current dependencies by using the following code. This step is important because if the updates go pear-shaped, you will have a point you can revert back to.

# record the current dependencies in a file called renv.lock
renv::snapshot()

# commit the lockfile alongside your code in version control
# and use this function to view the history of your lockfile
renv::history()

# if an upgrade goes astray, revert the lockfile
renv::revert(commit = "abc123")

# and restore the previous environment
renv::restore()

Upgrade packages in whichever way you normally do. One method is using the pak package:

pak::pkg_install("ggplot2")

Virtual environments in Python

In Python you can create an isolated environment for your project and load packages into it. This is possible with tools like virtualenv and pipenv.

You can set up a virtual environment in your project folder, activate it, install any packages you need and then record them in a file for use in future. One way to do this is with virtualenv. After installation and having navigated to your project’s home folder, you can follow something like this from the command line:

virtualenv venv  # create virtual environment folder
source venv/bin/activate  # activate the environment
pip install packageName  # install packages you need
pip freeze > requirements.txt  # save package-version list
deactivate  # deactivate the environment when done

When another user downloads your version-controlled project folder, the requirements.txt file will be there. Now they can create a virtual environment on their machine following steps 1 to 3 above, but rather than pip install packageName for each package they need, they can automate the process by installing everything from the requirements.txt file with:

pip install -r requirements.txt

This will download the packages one by one into the virtual environment in their copy of the project virtual environment. Now they’ll be using the same packages you were when you developed your project.

Containers

Good package management deals with one of the major problems of dependency hell. But the problem is bigger. Collaborators could still encounter errors if they:

try to run your code in a later version of the language you used during development
use a different or updated Integrated Development Environment (IDE, like RStudio or Jupyter Notebooks)
try to re-run the analysis on a different system, like if they try to run code on a Linux machine but the original was built on a Microsoft machine

What you really want to do is create a virtual computer inside your computer – a container – with everything you need to recreate the analysis under consistent conditions, regardless of who you are and what equipment you’re using.

Imagine one of those ubiquitous shipping containers. They are:

capable of holding different cargo
can provide an isolated environment from the outside world
can be transported by various transport methods

This is desirable for projects as well. You can put whatever you want inside, isolate it, and be able to run it from anywhere.

Docker

Docker is a container system. It works like this:

Create a ‘dockerfile’. This is like a plain-text recipe that will build from scratch everything you need to recreate a project. It’s just a textfile that you can put under version control.
Run the dockerfile to generate a Docker ‘image’. The image is an instance of the environment and everything you need to recreate your analysis. It’s a delicious cake you made following the recipe.
Other people can follow the dockerfile recipe to make their own copies of the delicious image cake. Each running instance of an image is called a container.

You can learn more about this process by following the curriculum on Docker’s website. You can also read about the use of Docker in the Department for Work and Pensions (DWP). Phil Chapman wrote more about the technical side of this process.

You don’t have to build everything from scratch. Docker hub is a big library of pre-prepared container images. For example, the rocker project on Docker hub lists a number of images containing R-specific tools like rocker/tidyverse that contains R, RStudio and the tidyverse packages. You can specify a rocker image in your dockerfile to make your life easier. Learn more about rOpenSci labs tutorial

As well as rocker, R users can set up Docker from within an interactive R session: the containerit package lets you create a dockerfile given the current state of your session. This simplifies the process a great deal. R users can also read Docker for the useR by Noam Ross and an introduction to Docker for R users by Colin Fay.