class: left, middle, inverse, title-slide .title[ # Front Matter ] .author[ ### David Benkeser, PhD MPH
Emory University
Department of Biostatistics and Bioinformatics
] .date[ ### INFO550
.white[bit.ly/info550]
] --- <style type="text/css"> .remark-slide-content { font-size: 22px } </style> ## Basics * __Course website__: https://bit.ly/info550 * __Meeting__: Thursdays 1:00 - 2:50pm * __Location__: GCR 119 or [Zoom](https://canvas.emory.edu/courses/94385) * __Attendance policy__: In-person attendance is __not required__ but synchronous attendance in Zoom sessions is __strongly encouraged__. Please read [the full syllabus](https://benkeser.github.io/info550/syllabus). ??? Most material for class will be made available via the course website. The website is hosted on GitHub and all materials are available at the GitHub repository (https://github.com/benkeser/info550). Classes will be a mix of lectures and interactive exercises. Synchronous attendance is strongly encouraged. --- ## Etiquette Please __interrupt me__ to ask questions. * Too hard to pay attention to Zoom window. * I will __try__ to remember to repeat questions. Please feel free to ask __any question__ you like. * Clarification questions are __always welcome__. * Enrichment questions will be answered __time-permitting__. * Come to office hours! I encourage you to leave your camera on during class, but not required. * At least during breakout sessions would be nice. --- ## Office hours __David's office hours__ * Mondays 12-1PM * GCR 322 or Zoom __Sohail's office hours__ * Wednesdays 9-10AM * Zoom? TBD --- ## Learning objectives * Understand why automation is a key element of reproducible data science. * Operate capably and comfortably at the command line. * Implement best practices for version control and open source projects. * Produce reproducible workflows for data cleaning, analysis and report generation using the suite of tools learned in the class. * Create simple R packages. * Build Docker containers and use them to develop containerized workflows. * Understand basic uses of bash and python. * Utilize Jupyter notebooks. * Utilize AWS cloud computing services for computation and storage. --- ## What INFO550 is A class to... * make your science more __reproducible__; * increase scientific impact by making code __accessible__ and __readable__; * __survey tools__ in modern data science; * __make your life easier__. ??? The goal is to teach best practices that we should strive for in our work and to provide motivation and brief introduction to the tools that enable us to make that happen. --- ## What INFO550 is All of the things that I had to teach myself in grad school and beyond that have proven helpful in my research. ??? I am not a true computer scientist (and it's not particularly close!). I don't know everything, and my approach won't always be the most effective. Please offer me suggestions along the way! --- ## What INFO550 is .red[not] A class... * on machine learning; * where the "right" answer is *somewhere* in the slides; * that extensively covers any one area in great depth. ??? We will not discuss statistical methodology, though I may at times show some basic statistics to illustrate software. Learning these tools requires practice and (often a great deal of) frustration -- it's part of the process. Thankfully, there is a large online community that you can access if you know how. Learning to google efficiently is surprisingly important. --- ## The wrong attitude .center[*Other students may be okay with searching the internet for help with R coding, but that's not very helpful for all students.*] .center[\- A disgruntled student] ??? Quote from a review of a previous course. No one writes perfect code. Code breaks. And there's no one in the world who immediately understands every error message to crop up. You must be willing to take the initiative to figure it out. Learning to efficiently fix broken code is almost as important as learning how to write good code in the first place. --- ## Grading This is a __letter graded__ course. * 100% from homework assignments Homeworks will be a mix of project-related assignments, software installations, and other tasks. Participation based on peer-graded breakout groups (next slide). --- ## Breakout groups During class, we will have __breakout groups__ for in-class exercises. * Learning to code alone is frustrating. * Learn from each other. Through a year of hybrid teaching, I have learned that breakout groups are controversial. * Feel free to do breakout exercises on your own. The breakout exercises will often require going beyond what is in the course notes (i.e., searching the internet for a solution). * Try things on your own, but chat with each other about your process. * Share the resources that you find (via Canvas or Zoom chat?). * Share screens/code with class mates. * If you are an advanced user in a topic, take a minute to explain why something works. ## Class composition This class is __very diverse__. * some topics are familiar to many students, but brand new to others; * some topics are new for almost everyone. Bear that in mind and be kind to other students (and your professor). --- ## Project Throughout the semester, you will __develop a project__. * Ideally, something related to your own work. Project must: * involve data __manipulation__ (download, cleaning); * data __analysis__ (summary statistics or figures OK); * use some __downloadable software__ beyond base R (e.g., `R` packages); * __report__ summarizing analysis (including __plot and table__). Homework most weeks will be to __develop an aspect__ of the project. * Often, a peer will grade you on how easy it is to run your code. Data resources * [UCI](https://archive.ics.uci.edu/), [kaggle](https://www.kaggle.com/datasets), [SEER](https://seer.cancer.gov/data/), [Harvard Dataverse](https://dataverse.harvard.edu/) ??? If you do not have access to a real data project at this point, that is OK. Try to think about the types of projects you expect to handle and develop a project that "looks like" that. What types of tables/figures are typically produced in your field? Build an automatic script for making a clean table 1. Build an automatic script for making the results of `glm` look nice. Fancy statistics are not required. An overarching goal of the course is to make you write code that other people (or yourself, at a future date) can use. Thus, you will be graded precisely on how well your code is documented/how easy it is for another person to use it. --- ## Slides The slides are created using the [xaringan](https://slides.yihui.org/xaringan/#1) package. To print the notes as a .pdf (allegedly works best using Chrome): * scroll through all slides so content loads; * File: Print (or `super + p`): Destination: Save as PDF. The slides contain additional "presenter" notes. * Makes the material more self-contained. * Makes sure I do not forget to tell you important things. * To view, enter *Presenter Mode* by pressing `p`. ??? `super` means `Cntrl` on Windows and `Command` on Mac