class: left, middle, inverse, title-slide .title[ # Project organization ] .author[ ### David Benkeser, PhD MPH
Emory University
Department of Biostatistics and Bioinformatics
] .date[ ### INFO550
.white[bit.ly/info550]
.white[Additional reading]
] --- background-color: #007dba class: title-slide, center, inverse, middle *File organization and naming are powerful weapons against chaos.* <div> [Jenny Bryan](https://jennybryan.org/) ??? Project organization may seem a rather dull/pedantic topic for a data science class, but it is actually very important! <style type="text/css"> .remark-slide-content { font-size: 25px } </style> --- ### Basic principles * .red[Put everything in one version-controlled directory.] * Develop your own system. * Be consistent, but look for ways to improve. * naming conventions, file structure, `make` structure * Raw data are sacred. Keep them separate from everything else. * Separate code and data. * Use `make` files and/or READMEs to document dependencies. * No spaces in file names. * Use meaningful file names. * Use YYYY-MM-DD date formatting. * .red[No absolute paths.] * .red[Use a package management system.] ??? There is not one correct way to organize a project, but there are many incorrect ways. Over time you will develop your own system for doing so. We'll discuss a few frameworks that may be helpful. The biggest thing is to commit to using the same structure for as long as its working for you. As you are working on many projects, be self-aware during the process. Are you consistently losing track of certain types of files? Are there gaps in the workflow that make things hard to reproduce? What can you do to fill in the gaps? Organization might not seem like a big deal now, but remember your career is just getting started. Expect to work on 2-20 different projects a year for the rest of your working life. The number of projects you will work on is staggering! You need a way to keep track. --- background-color: #84754e class: title-slide, center, inverse, middle *You mostly collaborate with yourself, and me-from-two-months-ago never responds to email.* <div> [Karen Cranston](http://kcranston.github.io/) ??? This is as much for preserving your own sanity as it is about being reproducible (though it is about that too). There's a (big) upfront cost here, but in the end it is worth it. --- ### What to organize? It is probably useful to have a system for organizing: * data analysis projects; * first-author papers; * talks. The systems should adhere to the same general principles, but different requirements may necessitate different structures. .red[Think about organization of a project from the outset!] ??? Thinking about project organization from the outset is the most important thing. Map out (e.g., as comments in a README) how you see the project developing. Even if things change over time, it's good to have a structure in place from the beginning. This pertains to project directories in particular, but also to your hard drive in general. I made the mistake of organizing projects by institution (UW, Berkeley, Emory), but am now regretting that. What about projects that weren't finished when I got to Emory? What about projects that I continued developing once I got here. It's a mess. Don't be like me. --- ### Collaborative projects Collaborative projects present a greater challenge. * Not everyone is comfortable with LaTeX or git or ... I don't have a great solution for this. * Google drive/Word online helps to a certain extent, but you lose in other areas (reference management, math typesetting) * [Overleaf](http://overleaf.com/) has gotten much better for LaTeX Some advice: * Address organization from the outset. * Ideally, bring people on board to your (version controlled, reproducible) system. * Keep open lines of communication (especially if using GitHub) ??? There's almost always some pain associated with working in close collaboration with someone else on a project. The most important thing is to commit to being organized from the outset. And to agree on what that means. Even if some elements of the project are outside your control, you can try to bring in elements to your workflow. * E.g., if you receive comments with tracked changes from a colleague, incorporate them into the document, add a commit message describing who's edits they were. If working on a shared GitHub repo, keep open lines of communication, e.g., short emails (or [slack](https://slack.com/) messages, etc...), "Just pushed `x`..." --- ### Example data analysis project .code-box[ analysis/ <br> raw_data/ <br> data/ <br> R/ <br> R/00_clean_data.R <br> R/01_fit_models.R <br> R/02_make_figures.R <br> R/03_summarize_results.R <br> R/04_report.Rmd <br> figs/ <br> sandbox/ <br> sandbox/exploratory.R <br> ref_papers/ <br> Makefile <br> README.md <br> renv ] ??? Some notes: * separate raw data and processed data * separate folder for figures (could possible move R code for figures there) * `sandbox` is where I like to keep messy stuff that I'm trying out but never want to see the light of day * informative file names for `R` scripts broken down into logical steps of a workflow * `.gitignore` would include * definitely: `figs/*`, `ref_papers/*` * possibly: `raw_data`/`data` (if sensitive), `sandbox` (if informal) The `00`, `01`, ... number system is something that many people use because it helps create a sortable file system. It generally works, but some workflows don't really logically follow this sort of convention (e.g., things can happen in parallel). --- ### Example paper organization .code-box[ paper/ <br> analysis/ <br> analysis/README.md <br> analysis/00_clean_data.R <br> analysis/01_fit_models.R <br> analysis/02_make_figures.R <br> analysis/sandbox <br> sim/ <br> sim/README.md <br> sim/helper_functions.R <br> sim/sim_script.R <br> sim/run_sim_script.sh <br> sim/sandbox <br> figs/ <br> notes/ <br> ref_papers/ <br> submitted/ <br> revision/ <br> final/ <br> README.md <br> Makefile <br> my_paper.tex <br> my_refs.bib ] ??? Biostat methods papers often have three components: math, simulation, data analysis (prove your method works mathematically, show that it works on fake data, show it can be used on real data). Often these papers must be prepared using LaTeX. Here's an example directory of such a paper. Some notes: * separate READMEs for main project, `/analysis`, `/sim` * if you submit to arxiv, medRxiv etc..., include that link somewhere you can easily find, e.g., in main README. * `/notes` contains random musings ("what would happen if...") and un-prettified math (e.g., pictures of my white board taken with a cell phone -- very high tech); could also include key emails from collaborators (easier to find while working on project) * `submitted` might include the compiled pdf that was submitted to the journal and a README that indicates which commit contained the submitted version of the repo. Similar story for `revision` and `final`. * LaTeX produces lots of ugly files when it compiles documents (`.aux`, `.bbl`), in your `make paper` statement, remove those at the end. --- ### Example talk organization .code-box[ jsm2020_talk/ <br> R/ <br> figs/ <br> README.md <br> Makefile <br> my_talk.Rmd <br> my_refs.bib ] ??? Similar story here as with a paper. I have primarily used beamer (LaTeX) for talks in the past, but after this course, may be making the move to xaringan slides. Depending on how often you give a talk, you might consider: * naming folder after talk subject rather than conference/talk location * including an `old` folder that contains past versions of the talk given at previous locations (or could just tag GitHub repo at certain points) --- ### Example class organization See the [class GitHub repository](https://github.com/benkeser/info550)! --- ### Organizing data Raw data are sacred... but may be a mess. * You'll be surprised (and disheartened) by how many color-coded excel sheets you'll get in your life. Tempting to edit raw data by hand. .red[Don't!] * Everything scripted! Use meta-data files to describe raw and cleaned data. * structure as data (e.g., `.csv` so easy to read) ??? The generation of the raw data may be the one thing out of your control in an analysis. But from the time the data are passed on to you, everything that happens should be reproducible. This can be painful. For many projects, 90% of time might be devoted to wrangling raw data into a format that is usable. Save yourself from the danger of having to re-do all those painful bits when (not if) the data change. --- ### Organizing data Hadley Wickham [defined the notion of tidy data](https://doi.org/10.18637/jss.v059.i10). * Each variable forms a column. * Each observation forms a row. * Each observational unit forms a table. .code-box[ | ptid | day | age | drug | out | |------|-----|-----|------|-----| | 1 | 1 | 28 | 0 | 0 | | 1 | 2 | 28 | 0 | 1 | | 2 | 1 | 65 | 0 | 0 | | 2 | 2 | 65 | 1 | 1 | | 3 | 1 | 34 | 0 | 0 | | 3 | 2 | 34 | - | 1 | ] ??? I am asking you to read this paper for homework. It's a worthwhile read. Hadley's packages like `dplyr` are very useful if you want to take the time to learn them. I probably should do that myself one day... --- ### Exploring data One of the first things we'll often do is open the data and start poking around. * Could be informal, "getting to know you." * Could be more formal, "see if anything looks interesting." This is often done in an ad-hoc way: * entering commands directly into R; * making and saving plots "by hand"; * etc... .red[Slow down and document.] * Your future self will thank you! ??? You want to avoid situations like: * need to recreate a plot that you made "by hand" and saved "by hand"; * figuring out why you removed certain observations; * trying to remember what variables had an interesting relationship that you wanted to follow up on later. ### Exploring data Write out a set of comments describing what you are try to accomplish and fill in code from there. * I do this for every coding project. * Data analysis, methods coding, package development Leave a search-able comment tag by code to return to later * I use e.g., `# TO DO: add math expression to labels; make colors prettier`. Sets "the bones" of a formal analysis in place while allowing for some creative flow. ??? From the outset, stop and think about what you want to do. Start filling in details from there. That simple approach will increase efficiency and reproducibility. --- ### Exploring data Other helpful ideas for formalizing exploratory data analysis: * `.Rhistory` files * all the commands used in an R session * Informal `.Rmd` documents. * easy way to organize code/comments into readable format * `save` intermediate objects and workspaces * and document what they contain! * `knitr::spin` * writing `.R` scripts with rendered-able comments --- ### The `here` package .red[No absolute paths.] * Absolute paths are the enemy of project reproducibility. For `R` projects, the [`here`](https://here.r-lib.org/) package provides a simple way to use relative file paths. * Read [Jenny Bryan and James Hester's chapter](https://rstats.wtf/project-oriented-workflow.html) on project-oriented work-flows. The use of `here` is dead-simple and best illustrated by example. --- ### The `here` package Consider this simple project structure. .code-box[ my_project/ <br> data/ <br> my_data.csv <br> output/ <br> R/ <br> R/my_analysis.R <br> Rmd/ <br> Rmd/my_report.Rmd <br> ] Here, the folder `my_project` is the __root directory__. * Where `.git` lives * All file paths should be .green[relative] to `my_project`! --- ### The `here` package Each `R` script or `Rmd` report, should contain a call to `here::i_am('path/to/this/file')` at the top. * `path/to/this/file` should be replaced with the path .green[relative] to the project's __root directory__. * `here::i_am` means use function `i_am` from `here` package. For example, the file `R/my_analysis.R` might look like this. ```` # include at top of script here::i_am('R/my_analysis.R') # now add all your great R code... ```` --- ### The `here` package `Rmd/my_report.Rmd` should include an `R` chunk that calls `i_am`. ```` --- output: html_document --- ```{r} here::i_am('Rmd/my_report.Rmd') ``` <!-- Now the rest of your Rmd code --> ```` --- ### The `here` package The call to `i_am` establishes the __root directory__. * Subsequent file paths can be made using the `here` function. For example, `my_analysis.R` might look like this. ```` # include at top of script here::i_am('R/my_analysis.R') # load data my_data <- read.csv(here::here('data', 'my_data.csv')) # do some analysis to get results # save results save(my_results, file = here::here('output', 'my_results.RData')) ```` ??? You can check what `here` thinks the root directory is by printing the output of `here::here()`. --- ### The `renv` package .red[Use a package management system.] To increase reproducibility of a project, we must keep track of what packages are used. * Want to avoid chasing down 100 errors like this: .code-box[ .red[Error in library(ggplot2) : there is no package called ‘ggplot2’] <br> ] The [`renv` package](https://rstudio.github.io/renv/articles/renv.html) is useful to this end. --- ### The `renv` package Download the <a href='example_project.zip' download>`example_project`</a> folder. .code-box[ example_project/ <br> Makefile <br> figs/ <br> R/ <br> R/barchart.R <br> Rmd/ <br> Rmd/report.Rmd <br> ] If you scan the code, you will see we need __three R packages__. * `here`, `wesanderson`, `knitr`, and `rmarkdown` --- ### The `renv` package Open an `R` session from the `example_project` folder. * Install the `renv` package if needed. Initialize the project by running the following command. ```r renv::init() ``` You will see a lot of output. What just happened?? --- ### `renv.lock` file You should now see __lockfile__, `example_project/renv.lock`. ```` { "R": { "Version": "4.0.3", "Repositories": [ { "Name": "CRAN", "URL": "https://cloud.r-project.org" } ] }, "Packages": { "base64enc": { "Package": "base64enc", "Version": "0.1-3", "Source": "Repository", "Repository": "CRAN", "Hash": "543776ae6848fde2f48ff3816d0628bc" }, ... ```` --- ### `renv.lock` file This __lockfile__ records all of the information about packages needed by your project. * Version of package, where was it installed from, etc... How does it know? * `renv` scans all files in your project directory. * Looks for `library`, `require`, or `package::function`. --- ### `.Rprofile` We also see `example_project/.Rprofile`, containing one line. ```r source("renv/activate.R") ``` We have not talked about `.Rprofile`'s because they are generally antithetical to reproducible research. When `R` starts, it searchers for `.Rprofile` and runs what it finds. * Use this to change various options, always load packages, etc... In this case, whenever we start `R` in `example_project`, we __activate__ our `R` environment. * Telling `R` where packages for this project are saved, etc... * Details are not too important. --- ### `renv` folder You will also see a folder `example_project/renv/`. * This folder contains your __project library__. `renv` tries to be clever about installing packages. * Already have a package installed elsewhere? `renv` will link to it. * Otherwise, package is installed in `renv/library`. Note that `renv/.gitignore` ensures packages not put under version control. --- ### Activating `renv` `renv` will automatically be active in any `R` session that is run from the `example_project` directory. * Recall the presence and function of `.Rprofile`. To activate in an `R` session run from elsewhere: `renv::activate('path/to/renv')`. For using `renv` with R Markdown projects, this is important! * Either need to `activate` in a code chunk. * Or use `knitr::opts_knit$set(root.dir = here::here())` (assuming the project root contains your `renv` project library). --- ### Collaborating with `renv` A typical `renv` collaborative workflow on GitHub: * .usera[User A] initializes the lockfile using `renv::init()`. * .usera[User A] commits `renv.lock`, `.Rprofile`, and `renv/activate.R` and pushes to GitHub. * .userb[User B] pulls from GitHub, opens `R`, and uses `renv::restore()` to synchronize their local project directory. * .userb[User B] adds new packages to code, uses `renv::snapshot()` to record changes to `renv.lock` * .userb[User B] commits `renv.lock` and pushes to GitHub. * .usera[User A] pulls from GitHub, opens `R`, and uses `renv::restore()` to synchronize their local project directory. * ... --- ### Breakout exercise With a partner, choose .usera[User A] and .userb[User B]. .usera[User A] * initialize the lockfile for `example_project` * open `R` in `example_project` and run `renv::init()` * use `git init` to initialize version control of `example_project` * create a GitHub repository for `example_project` * add the GitHub repository as a remote to your local repository * .small[`git remote add origin https://github.com/usera/example_project.git`] * .small[`git remote add origin git@github.com:usera/example_project.git`] * commit all files locally and push to repository --- ### Breakout exercise .userb[User B] * create a fork of .usera[User A]'s repository * `git clone` the fork to your local machine * `cd` into `example_project` and open `R` * run `renv::restore()` to synchronize package library * confirm that you can build the report * run `make report` from `example_project` directory * open `Rmd/report.html` to confirm correct build --- ### Breakout exercise .userb[User B] now wants to change the colors of the graph * open `R` in `example_project` directory * run `renv::remove('wesanderson')` to remove the `wesanderson` package from the lockfile * replace lines 4-5 of `barchart.R` with the following and save ```r library(RColorBrewer) colors <- brewer.pal(3, "Dark2") ``` * open `R` in `example_project` and run `renv::status()` * if prompted, run `install.packages("RColorBrewer")` * run `renv::snapshot()` to add `RColorBrewer` to the lockfile * commit changes and push to your fork * submit a PR to .usera[User A]'s repository --- ### Breakout exercise .usera[User A] * add a remote linking to user B's repository * .small[`git remote add userb https://github.com/userb/example_project.git`] * `fetch` .userb[User B]'s `master` branch * .small[`git fetch userb master`] * `checkout` .userb[User B]'s `master` branch * .small[`git checkout remotes/userb/master`] --- ### Breakout exercise .usera[User A] * `cd` into `example_project` and open `R` * run `renv::restore()` to synchronize your package library * confirm that you can build the report * run `make report` from `example_project` directory * open `Rmd/report.html` to confirm correct build --- ### Breakout exercise .usera[User A] * if the report builds correctly, create a local branch named `userb` from .userb[User B]'s `master` branch * .small[`git checkout -b userb`] * `merge` .userb[User B]'s `master` into your `master` * .small[`git checkout master`] * .small[`git merge userb`] * close the PR by pushing to GitHub * .small[`git push origin master`] --- ### Breakout exercise .usera[User A] has changed their mind about the colors! * open `R` in `example_project` directory * run `renv::remove('RColorBrewer')` to remove `RColorBrewer` from the lockfile * replace lines 4-5 of `barchart.R` with the following and save ```r colors <- c("red", "blue", "green") ``` * open `R` and run `renv::status()` and `renv::snapshot()` * confirm that you can build the report * run `make report` from `example_project` directory * open `Rmd/report.html` to confirm correct build * if the report builds correctly, `commit` and `push` --- ### Breakout exercise .userb[User B] * add a remote linking to user A's repository * .small[`git remote add usera https://github.com/usera/example_project.git`] * `fetch` .usera[User A]'s `master` branch * .small[`git fetch usera master`] * `checkout` .usera[User A]'s `master` branch * .small[`git checkout remotes/usera/master`] --- ### Breakout exercise .userb[User B] * `cd` into `example_project` and open `R` * run `renv::status()` and, if needed, `renv::restore()` to synchronize your package library * confirm that you can build the report * run `make report` from `example_project` directory * open `Rmd/report.html` to confirm correct build --- ### Breakout exercise .userb[User B] * if the report builds correctly, create a local branch named `usera` from .usera[User A]'s `master` branch * .small[`git checkout -b usera`] * `merge` .usera[User A]'s `master` into your `master` * .small[`git checkout master`] * .small[`git merge usera`] * update your fork of the repository by pushing to GitHub * .small[`git push origin master`]