Reproducibility: Using Fixed CRAN Repository Snapshots · MRAN Skip to main content

Reproducibility: Using Fixed CRAN Repository Snapshots

Latest Isn't Always Greatest

CRAN has always provided access to the latest versions of the vast majority of R packages. With the growing popularity of R, the sea of packages hosted on CRAN is in a continuous state of flux with packages being added, updated, and archived all the time and multitudes of mirrors. To some this might seem like an advantage. To many, however, these ever-changing packages present significant challenges.

For example, a package you used yesterday may have been updated overnight, or maybe one of its dependencies did, and now your script no longer works as expected. Developers are left wondering, "When do they plan to fix and update this package? Do I need to rewrite my script?” Packages get fixed whenever their maintainers choose to do so -- whether that's today, tomorrow, or next month. Each time a package breaks, so will all of the scripts using that version of the package. This approach is clearly suboptimal with respect to the stability that R programmers crave.

Similarly, whenever users point to the latest CRAN repository, install.packages could install one version of the package for 'User_A' today, another version of that same package for 'User_B' who points to a different mirror, or even a “package not found” error when 'User_C' attempts to install tomorrow. Once again, this inconsistency presents challenges when sharing scripts.

Daily CRAN Repository Snapshots

To address this, Microsoft offers a downstream distribution of CRAN R packages. Since September 17th, 2014, the checkpoint server has been taking a daily snapshot at precisely midnight UTC of the entire CRAN repository and storing it on MRAN. These snapshots have been available to the R community ever since.

Try out the CRAN Time Machine. Just pick a date and explore the contents of CRAN as they were on that date.

Each version of Microsoft R Open is preconfigured to point to specific snapshot. And, any R user can install a package version from any existing snapshot using the checkpoint package.

Note: Non-CRAN packages, such as those available on GitHub, are not part of the snapshot process.

Reproducibility with Microsoft R Open

Microsoft R Open offers you reproducibility out-of-the box. During the installation of Microsoft R Open, the CRAN repository is configured to point to a specific CRAN repository snapshot. For Microsoft R Open 3.3.2, that fixed CRAN repository snapshot is the snapshot taken on November 1, 2016. As a result, with Microsoft R Open 3.3.2, install.packages will always retrieve packages as they were at midnight UTC on November 1, 2016 by default.

The advantages of using a fixed CRAN repository snapshot are two-fold:

  1. Reproducibility across time. The CRAN package versions you use won’t change unless you make them. That means that not only will the package version you used when you wrote your script remain the same, but so will all of the dependencies of that package, and even the dependencies of those dependencies. Non-CRAN packages are not part of this snapshot and the latest version will be used.
  2. Reproducibility across Microsoft R Open users. Using a fixed CRAN repository snapshot means that every user of Microsoft R Open has access to the same set of CRAN package versions. Not only will you always retrieve the same package versions, but so will all of the other Microsoft R Open 3.3.2 users. This is why your script that works for you today will also work for all other Microsoft R Open users who are using the same fixed CRAN repository snapshot as you tomorrow, next week, and next month.

This makes the sharing of R code that relies on R packages easier, and reduces the chance of incompatible R packages being installed on the same system.

While having access to the latest CRAN packages is useful at times, stability just isn't possible if you are pointing to the ever-moving, ever-evolving set of packages on CRAN. No need to worry about missing out either. Each new release of Microsoft R Open will point to a more recent fixed CRAN repository snapshot. Read about standardized release dates...

If you find that you need a few packages from another date, use the checkpoint package. If you need to point Microsoft R Open to another repository entirely, follow the instructions provided here.

And, if you are interested in reproducible research, we highly recommend that you disable the workspace save and reload feature. In this way, you are sure you are working in a clean environment that offers reproducible results by anyone with the same package versions. Learn more...

Package Time Machine (checkpoint)

Need a package from another time? No problem. The checkpoint package, which is installed by default during the installation of Microsoft R Open, is yet another enhancement to R. This package is designed to make it easy to write reproducible R code by allowing you to go backward (or forward) in time to retrieve the exact versions of the packages you need.

The checkpoint package offers yet another way to promote reproducibility when working with others. Using checkpoint in your scripts allows you and anyone else with that script to access the same specified CRAN repository snapshot, and consequently, install and use the same package (and dependency) versions needed.

With checkpoint, you can access any existing snapshot, not just the one preconfigured for your version of Microsoft R Open. In fact, you could even use the checkpoint() function once to retrieve packages from one date, and then use it again to retrieve packages from a different date all in the same script.

To make use of this feature, add the following lines to your R code and take a ride in the checkpoint time machine:

library(checkpoint)
checkpoint("YYYY-MM-DD")

Where YYYY-MM-DD is the date of the CRAN repository snapshot you want to use, such as 2016-11-01. Explore this package and its documentation...

For the argument to the checkpoint() function, choose any date in the past (say, yesterday’s date), and checkpoint will install all packages required by your project as they were at midnight UTC on the specified date. When you use the checkpoint package, all packages that existed on the specified date are installed in a subfolder for your project underneath ~/.checkpoint. This means that packages installed for one project are independent of packages installed for all other projects, unless they use the same checkpoint date.

The checkpoint package is installed by default with Microsoft R Open. Get the latest development version, submit bug reports, and make feature requests on the checkpoint Github pages.

Standardized Release Dates

Our hope is to offer further stability gains to all R users in the future by publishing the planned release dates and encouraging the R community to develop and release a stable set of packages together. With these release dates in mind, R package developers can test their packages with others, and time their release to coincide with those dates. And, if enough people do this, we'll have stability in the R ecosystem: reproducible script results using stable packages.