class: left, middle, inverse, title-slide .title[ # R Markdown for reproducible reports ] .author[ ### David Benkeser, PhD MPH
Emory University
Department of Biostatistics and Bioinformatics
] .date[ ### INFO550
.white[bit.ly/info550]
.white[Additional reading]
] --- <style type="text/css"> .remark-slide-content { font-size: 22px } </style> ### Analysis reports As a <del>statistician</del> __data scientist__, you must effectively convey results. * Figures, tables, text describing results, etc... __Solutions__: * email with attachments * LaTex pdf * __R Markdown__ Why __R Markdown__? * Human readable source * Highly reproducible reports * Output html instead of pdf ??? If your job involves a heavy amount of data analysis, you will end up creating hundreds if not thousands of reports in your career. Invest the time now to come up with an efficient system for doing this! When I was a grad student, there were limited tools for doing this well. sweave was around, but the learning curve was steep and was not too much easier than just using LaTex. The advent of `knitr` and `rmarkdown` was revolutionary for making reproducible research more accessible. One of the big downsides of LaTex is how ugly the source code is. Want to make text bold face `\textbf{something}`. Ditto html. It's clunky. Markdown has an elegant and simple syntax that is appealing to read. Reports made with `knitr` can, in theory, be reproduced by anyone, anywhere. * caveat is you need: same R version, packages, not to use absolute paths, etc.. Recently, I have started using html output for reports over pdf. Appealing aspects of html include: * it can be viewed in a web browser (on a mobile device) * no worrying about page breaks and fitting figures on a page * heavily customizable, if you know a little css --- ### R Markdown R Markdown is developed by [RStudio](https://rstudio.com/). * __R__ : it can incorporate `R` code chunks * but also `python`, `bash`, ... * __Markdown__ : based on markdown syntax * but also LaTex, html, ... No more .red[copy and paste] for results. * All automated! * Don't touch that mouse! ??? I have my own thoughts about the RStudio GUI, but the best thing the RStudio team has done is introduce R Markdown. In what follows, code chunks in blue background + highlighting are the code that might appear in your R Markdown file, i.e., a file with `.Rmd` extension that includes your code for building your output document. Objects inside gray boxes are examples of what the code will look like when rendered. Note that if you are viewing the plain text `.Rmd` that I use to make these slides, the code will look a little more complex, as I have to find ways of making R Markdown display verbatim code instead of evaluating that code. --- ### Markdown ``` # This is a section header ## This is a subsection header ### This is a subsubsection header __This is bold text__. *This is italic text*. This is `monospace text for inline code`. ~~This text is struckthrough.~~ * A bulleted list * Another item in my list 1. A numbered list 2. The second item in my list --- text between horizontal rules --- [Hyperlinks are easy](https://google.com) ``` ??? A key design element of Markdown is its readability. The plain text files are imminently readable with various extra commands and markup associated with LaTex or html. Many websites allow text input in Markdown (e.g., GitHub comments, Stack Overflow, reddit, jekyll blogs, wordpress websites, ...). --- ### R Markdown R Markdown blends Markdown with code chunks. ````markdown Here is some plain text in my document explaining what I did. ```{r} # now I want to show some code my_variable <- rnorm(1) ``` ```` This renders as: .clearbox[ Here is some plain text in my document explaining what I did. .code-box[ \# now I want to show some code my_variable <- rnorm(1) ] ] ??? Use three back ticks + a language (in this case, R) to demarcate a "code chunk". When the document is compiled, code in these code chunks will be run in a fresh R session, while (before, actually) the document compiles. --- ### R Markdown You can also reference variables inline. ```markdown The value of the variable I created is `r my_variable`. ``` This renders as .clearbox[ The value of the variable I created is 0.3455167. ] No more copy and pasting results into text! ??? All inline code needs to be on a single line. * Helpful to use a hidden code chunk that defines variables and then reference variables by their name in the text. --- ### yaml header R Markdown documents begin with a __yaml header__. * "meta-data" for the document * options that control __how the document is rendered__ A basic __yaml header__: ````markdown --- title: "My Great Report" author: "Brilliant Student" date: "`r format(Sys.Date(), '%Y-%m-%d')`" output: html_document --- ```` ??? The yaml header includes all the information that `knitr` needs to know how to compile your document. Note that in the above, I have included inline R code to be rendered when the document compiles. * `format(Sys.Date(), '%Y-%m-%d')` formats the date in year-month-day format. --- ### yaml header You can include options in the yaml header. * note the nested structure * don't forget your `:`'s! * two spaces, __no tabs__! ````markdown --- title: "My Great Report" author: "Brilliant Student" date: "`r format(Sys.Date(), '%Y-%m-%d')`" output: html_document: toc: true highlights: "pygments" --- ```` ??? I am constantly messing up yaml syntax and chasing down mysterious errors. Once you have a header you like, I recommend sticking with it unless you have serious cause to change it. There are almost endless options for customizing the looks of documents. * Some are native to the type of document being rendered (e.g., LaTex-specific options) --- ### Output formats There are [many output formats](https://rmarkdown.rstudio.com/lesson-9.html#documents). Here are a few useful ones (with links to further documentation): .small[ | `output` | Description | |------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------| | [`html_document`](https://bookdown.org/yihui/rmarkdown/html-document.html) | html template with many preset themes | | [`pdf_document`](https://bookdown.org/yihui/rmarkdown/pdf-document.html) | pdf using LaTex | | [`word_document`](https://bookdown.org/yihui/rmarkdown/word-document.html) | MS Word document (can [create custom theme](https://bookdown.org/yihui/rmarkdown-cookbook/word-template.html)) | | [`ioslides_presentation`](https://bookdown.org/yihui/rmarkdown/ioslides-presentation.html) | HTML5 presentation slides | | [`beamer_presentation`](https://bookdown.org/yihui/rmarkdown/beamer-presentation.html) | pdf presentation slides with beamer | | [`powerpoint_presentation`](https://bookdown.org/yihui/rmarkdown/powerpoint-presentation.html) | MS Powerpoint presentation | ] --- ### knitr and pandoc * [`rmarkdown`](https://rmarkdown.rstudio.com/) * an `R` package that converts `.Rmd` to different file types * an `R` interface to `pandoc` * [`pandoc`](https://pandoc.org/) * a general purpose document converter * command line tool, no GUI * `brew install pandoc` * [`knitr`](https://yihui.org/knitr/) * an `R` package that executes code chunks * "knits" results back to document ??? It's useful to understand which program is doing what in case you need to chase down errors. Also, note that you need `pandoc` to render R Markdown. If you have R Studio installed, `pandoc` comes with it. Otherwise, you need to do a separate installation (e.g., using `brew` or `apt`) --- ### Rendering The "easiest" way to render documents is click "Knit" in RStudio. * Don't touch that mouse! We will be rendering documents from the __command line__. ```bash # from directory where example.Rmd file lives Rscript -e "rmarkdown::render('example.Rmd')" ``` See the [`render` documentation](https://rmarkdown.rstudio.com/docs/reference/render.html) for options. ??? Here we are calling `Rscript` (the R executable for scripts) from the command line. This starts a new session of `R` and executes the command `rmarkdown::render('example.Rmd')`, which creates the document `example.html`. The `::` syntax in the `R` command says, "in the `rmarkdown` package use the (exported, see R packages lecture) function `render`. * Allows us to avoid the extra line of code to load the library * e.g., you could also run `Rscript -e "library(rmarkdown); render('example.Rmd')"`, but its more typing. --- ### Breakout exercise 1 Download <a href="https://raw.githubusercontent.com/benkeser/info550/gh-pages/lectures/04_rmarkdown/examples/example.Rmd" download>`example.Rmd`</a>. * Render the document as is from the command line. * Without changing `example.Rmd`, render it as an `ioslides_presentation` that is saved as `example_presentation.html`. * Add __some additional options__ in the yaml header (e.g., change font size, or code highlighting), render the document again and see what changes. * In the __Executive Summary__ section, add inline code that reports the number of rows and columns in the `mtcars` data set. Render to make sure it works. --- ### Code chunk options * Add short __labels__ explaining what chunks do. * Include __options__ to control how code chunks behave. ````markdown Here is some plain text in my document explaining what I did. ```{r, label-of-chunk, echo = FALSE} # this code will not appear in my rendered document # but it is still run my_variable <- rnorm(1) ``` The value of the variable I created is `r my_variable`. ```` This renders as .clearbox[ Here is some plain text in my document explaining what I did. The value of the variable I created is 0.3455167. ] ??? Some notes on chunk labels: * avoid spaces, underscores, and periods * hyphens are safe Options for each code chunk are included, separated by commas. These must all appear on one line. There are __many__ such options that can be included. Here we'll go over just a few. For a full list see [here](https://yihui.org/knitr/options/). --- ### Code chunk options Option values can be __any valid `R` expression__. For example, ````markdown Here is some plain text in my document explaining what I did. ```{r, decide-whether-to-print, echo = FALSE} # define a variable that determines whether I will print # a subsequent code chunk should_i_print <- rnorm(1) > 0 ``` ```{r, maybe-gets-printed, echo = should_i_print} # writing code to print only some of the time doing_some_more_stuff <- "more cool code" ``` ```` --- ### Code chunk options If `should_i_print == TRUE`, this renders as .clearbox[ Here is some plain text in my document explaining what I did. .code-box[ \# writing code to print only some of the time doing_some_more_stuff <- "more cool code" ] ] If `should_i_print == FALSE`, this renders as .clearbox[ Here is some plain text in my document explaining what I did. ] ??? When might you use this? * You are writing a script that analyzes several outcomes. Set an option for a report that allows the user to decide which outcomes they are interested in. * Writing code for outlier diagnosis, *if outliers are present*. If not, do nothing. * ... --- ### Code chunk options Common options for __controlling display__ of results. | Option | Action | |:----------|:----------------------------------------------------| | `eval` | Run the code included in the chunk? | | `echo` | Show the code chunk in the rendered document? | | `warning` | Print warning messages generated by code? | | `error` | Print and keep running after errors? | | `message` | Print messages generated by code? | | `include` | Show the code and results in the rendered document? | --- ### Code chunk options Change __default options__ using code chunk at the top of the `.Rmd`. ````markdown ```{r, setup, include=FALSE} knitr::opts_chunk$set( echo = FALSE, warning = FALSE, fig.width = 6, fig.height = 6 ) ``` ```` --- ### Figures Figures produced by R code will be placed __after the code chunk__ from which they were generated. ````markdown ```{r, make-a-plot} # make up data x <- rnorm(10); y <- rnorm(10) # make a plot plot(x, y) ``` ```` See all [figure options here](https://yihui.org/knitr/options/#plots) ??? The most relevant options are `fig.height`, `fig.width`, and `fig.cap`. --- ### Tables There are several options for __making tables in R Markdown__. * just print a data.frame or matrix * .green[\+] easy, .red[-] monospaced format * [`knitr::kable`](https://bookdown.org/yihui/rmarkdown-cookbook/kable.html) * .green[\+] easy, nice looking, .red[-] not 100% flexible * [`gt`](https://gt.rstudio.com/) * .green[\+] tidyverse compatible syntax, highly flexible, .red[-] only html for now * [`xtable::xtable`](https://www.rdocumentation.org/packages/xtable/versions/1.8-4/topics/xtable) * .green[\+] highly customizable, .red[-] only for pdfs (LaTex based) * [`pander::pander`](https://cran.r-project.org/web/packages/pander/vignettes/knitr.html) * .green[\+] default tables for different `R` classes * [`stargazer::stargazer`](https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf) ??? I find `kable` sufficient for 90\% of my report needs and `xtable` for creating tables for LaTex manuscripts. `gt` looks like it will soon become industry standard, as all things tidyverse tend to do. Think about whether there is a need for really nicely formatted tables or whether raw output is OK. Whose the audience? --- ### Referencing The [`bookdown` package](https://bookdown.org/) allows for __table/figure referencing__. * use e.g., `output: bookdown::html_document2` ````markdown As we see in Figure \@ref(fig:make-a-plot), there is no relationship. ```{r, make-a-plot, echo = FALSE, fig.cap="A great figure I made"} x <- rnorm(10); y <- rnorm(10) plot(x, y) ``` ```` This renders as .clearbox[ As we see in Figure 1, there is no relationship. [figure here] <br> .small[Figure 1: A great figure I made] ] ??? A few notes: * You must have a `fig.cap` for this to work. * For tables they must also include a caption (e.g., `caption` option of `kable`) * If you have a code chunk named `my-chunk` that produces a figure, the reference will be `fig:my-chunk`. * If you have a code chunk named `my-chunk` that produces a table, the reference will be `tab:my-chunk`. * Again, avoid underscores in chunk names! --- ### Rounding Automating __rounding__ can be annoying, .red[but definitely do it]! * `round` drops 0s * `sprintf` is better ```r good_round <- function(x, digits = 2, pval = FALSE){ # just use round otherwise stopifnot(digits > 1) stopifnot(length(digits) == 1) if(pval){ stopifnot(x >= 0 & x <= 1) } tmp <- sprintf(paste0("%.", digits, "f"), x) zero <- paste0("0.", paste0(rep("0", digits), collapse = "")) tmp[tmp == paste0("-", zero)] <- zero if(pval & tmp == zero){ tmp <- paste0("<0.", paste0(rep("0", digits - 1), collapse = ""), "1", collapse = "") } return(tmp) } ``` ??? The `good_round` function solves a couple problems: * `round` will drop trailing zeros, which makes output inconsistent * `sprintf` will report e.g., `-0.00` * both will report p-values of 0, which is generally not good practice --- ### Rounding ```r good_round(215.125, digits = 2) ``` ``` ## [1] "215.12" ``` ```r good_round(0.003154, digits = 2) ``` ``` ## [1] "0.00" ``` ```r good_round(0.003154, digits = 2, pval = TRUE) ``` ``` ## [1] "<0.01" ``` ??? Please adhere to rounding conventions in your reports. It's annoying but your collaborators will thank you. --- ### Reproducible reports * .red[Do not use absolute paths in your code!] * e.g., `~/my_folder/that_you/dont_have/` * Keep all code and data in one directory * e.g., a GitHub repo * If you must use absolute paths, __define variables with directories__ at the top of your document and describe fully in `README`. * e.g., `save_dir = ~/my_folder` * For anything with randomness, `set.seed`! * Include a __final chunk__ that displays the output of `getwd()` and `devtools::session_info()`. * information about your system to help others reproduce * Use `Make` files to help users understand construction * next lecture --- ### Breakout exercise 2 Working again off `example.Rmd`: * add a chunk at the beginning that suppresses all code chunks; * add inline code in Executive Summary to describe conclusion; * add code to chunk `make-scatter` that produces a scatter plot and modify options until it looks good; * add code to chunk `make-table` and modify until it looks good (make sure you round!).