knitr: Elegant, flexible and fast dynamic report generation with R

by Yihui Xie

The knitr package is an alternative tool to Sweave with a more flexible design and much more features. CRAN page: http://cran.r-project.org/package=knitr ; development repository: https://github.com/yihui/knitr ; website (documentation and demos): http://yihui.name/knitr/ ; manual: https://github.com/downloads/yihui/knitr/knitr-manual.pdf (use the Adobe Reader to see something surprising there)

For those who are not familiar with dynamic report generation or literate programming with R, the idea is that R code can be mixed with a document, and we can compile the document with R code being evaluated and corresponding results (including numeric output and graphics, etc) written into the output document (a minimal introduction). Sweave has implemented the basic idea, and there is still lots of room for improvement for production use. For example, it is unrealistic to restrict one plot per code chunk, and knitr is more natural on this issue. Here is the screenshot of an example taken from the knitr manual:

Elegance

The elegance of knitr comes from several aspects, like code reformatting (with the formatR package, to make R code better formatted), highlighting (with the highlight package, to make R code more readable), support for tikz graphics (with the tikzDevice package, to produce high-quality R graphics; see the graphics manual for examples) and careful consideration on details. For example, by default there are no prompt characters like > and + in the R code output, so it is easy for the reader to copy and run the code; the results returned by R are masked in ## so the reader can see the results without mangling the R source code (i.e. the output is still valid R code); the default number of digits was set to be small so we do not get too many digits.

In a word, knitr was designed with one belief in mind: beauty should come by default.

Flexibility

Unlike Sweave, which was mainly targeted at LaTeX, knitr was designed without any restrictions on the input or the output format. It can be quickly adapted to HTML or other types of output, since the core components (code extraction and evaluation) are not hard-coded. This is a simple example showing how knitr works with the markdown format:

https://github.com/yihui/knitr/blob/master/inst/examples/knitr-minimal.md

There is a whole set of hooks which can be used to customize the output (http://yihui.name/knitr/hooks). For example, you can easily use the listings package to decorate your R code if you do not like the default style.

There are more than 20 built-in graphical devices, including PDF, PNG, tikz and many devices in the Cairo or cairoDevice package, and it requires little effort to switch between difference devices.

The knitr package has put a lot of emphasis on graphics, and there are brand-new features like direct support for animations in LaTeX documents, as well as a quick support for rgl 3D plots; see the PDF manual for examples (use Adobe Reader to view animations).

Speed

Learning from cacheSweave and pgfSweave, knitr also has support for cache through different implementations. The idea of cache is that a code chunk can be skipped if its results have been cached before and the code has not been changed since then. This will make the code evaluation much faster. All the objects created in a cached chunk will be lazy-loaded, meaning the objects will not really be loaded into the current R session unless they are really used in the following chunks. Again, this will save some time on computing as well. Note the complete output of a code chunk is cached, which means the printed results as well as the graphics will show up as if they were created in real time by a code chunk (in cacheSweave, we lose printed results and graphics).

Here is a real world application on a time-consuming computing job -- analysis of the NRC rankings data via the Bayesian Lasso (Rnw source; PDF output). As we know, MCMC often involves with a large number of iterations, and this demo clearly shows the advantage of cache.

Summary

Although knitr is still a new baby, I can envision many potential applications in business due to the following reasons:

  • Dynamic and automatic report generation saves time and human efforts; as long as the R code has been set up correctly, knitr can take over the rest of job; it takes the same amount of efforts to compile a report once or ten thousand times;
  • Big data will definitely need the cache since it is often time-consuming to deal with, and we may not want to redo all the computing over and over again when generating reports automatically;
  • We need professional presentations of data and statistical models in business, and knitr tries to give beautiful output by default; we also need novel presentations as well, such as animations and sophisticated rgl 3D plots (see how boring and clumsy statistical reports usually are);
  • The report does not have to be restricted to a specific format such as LaTeX, and knitr is fully customizable to incorporate with different types of demand; as a trivial example, knitr can be used as the backend of http://www.inside-r.org/pretty-r/, or as an online data processing tool (think http://opencpu.org/);

Last but not least, for LyX users, I have also added support of knitr to LyX to make it really easy to use this package without taking care of the details of LaTeX. A short video is here: http://vimeo.com/32948939

Posted On Oct 25, 2011. Originally posted on inside-R.org.