Adapted from the Reproducible Science Curriculum
Special Thanks: Francois Michonneau, Hilmar Lapp, Karen Cranston, Jenny Bryan, and everyone else who contributed to these materials.
NEON has adapted them to our week long Data Institute.
Reproducibilty is actually all about being as lazy as possible!
– Hadley Wickham (via Twitter, 2015-05-03)
More efficient, less redundant science: others can build upon our work.
Reproducibility spectrum for published research. Source: Peng, RD Reproducible Research in Computational Science Science (2011): 1226–1227 via Reproducible Science Curriculum
Five selfish reasons to work reproducibly - Florian Markowetz
Reproducibility helps to avoid disaster
Reproducibility makes it easier to write papers
Reproducibility helps reviewers see it your way
Reproducibility enables continuity of your work
Reproducibility helps to build your reputation
For research to be reproducible, the research products (data, code) need to be publicly available in a form that people can find and understand them.
Collaborators
Peer reviewers & journal editors
Broad scientific community
The public
Figure 1. Distribution of reporting errors per paper. Papers from which data were shared has fewer errors.
Click on citation to view paper.
GitHub: Version Control / Collaboration / Dissemination
R Markdown or Jupyter Notebooks: Code Documentation / Dissemination
Over the next week, we will focus on the tools and skills associated with these facets.
The more self explanatory the better:
A variable name that describes the object is more useful than a random variable name.
Noble, William Stafford, 2009. A quick guide to organizing computational biology projects.
File Organization should:
File / Folder Names should be:
More on file naming & organization
– from the Reproducible Science Curriculum
Scripting vs. Point and click
Script = more time spent up front, but will save time in the long run.
Time Savings:
DRY – Don’t Repeat Yourself
If your analysis is composed of scripts, with repeated code throughout, it will be more time consuming to maintain and update.
Modularity – use functions to write code in reusable chunks
Document all workflow steps:
Code should be easy to understand with clear goals
Document your code even if you think it’s clear and simple. Your collaborators & your future self will inevitably have an easier time working with it down the road.
Add comments around functions that describe purpose, inputs and outputs.
Avoid proprietary formats: Use text files (.txt, .md) that don’t require special tools to open.
Markdown to style documentation = machine readable, small file size, low overhead.
Use coding approaches that connect data cleaning, analysis & results
R Markdown and Jupyter notebooks allow you to publish code and results in one (or more) output files.
Publishing is not the end of your analysis, rather it is a way towards your future research and the future research of others.
Example Workflow / Tools:
Document workflow: R Markdown / Jupyter Notebooks
Collaborate with Colleagues / Version Control : GitHub
Publish Data Snapshot: FigShare, Dryad, Zenodo, etc
Share workflow: Notebook Viewer, Binder
An overview of some of the topics, tools and skills that we will cover during the Data Institute
Documentation: Jupyter notebooks, GitHub
Organization: File naming / directory structure best practices
Automation: Efficient Coding Practices (in Python)
Dissemination: GitHub
Email: neondataskills@BattelleEcology.org