Reproducibility Is Beneficial
- To us as scientists
- To the scientific community
Five selfish reasons to work reproducibly - Florian Markowetz
- Reproducibility helps to avoid disaster
- Reproducibility makes it easier to write papers
- Reproducibility helps reviewers see it your way
- Reproducibility enables continuity of your work
- Reproducibility helps to build your reputation
Share vs. Publish vs. Archive
Sharing, Publishing, & Archiving are not the same things
Share
SHARE: any way of sharing information – could mean I emailed it to you
Publish
PUBLISH: The data / code are citable & discoverable
Archive
ARCHIVE: Long-term preservation – there is a long term plan to store (and provide access to) the data / code
Publish & Archive
In this presentation, we’ll focus on publishing & archiving
Common Questions
- Why publish?
- Who are we sharing with?
- What materials do we need to publish?
- When do we make them available?
- Where do we publish various outputs?
- How do we prepare materials for publication?
Let’s assume that we are at the point of submitting our manuscript
Why Publish?
- Increased visibility / citation
- Funding agency / journal requirement
- Community expects results from funded projects
Increased visibility / citation
Requirements Can Vary
- FUNDING AGENCIES:
- JOURNALS:
Why Publish / Share?
- Increased visibility / citation
- Funding agency / journal requirement
- Community expects it
- Better research
- More efficient, less redundant science
- Others can more effectively build upon your work
Why Share? Better Research.
Who do we need to share with?
For research to be reproducible, the research products (data, code) need to be publicly available in a form that people can find and understand them.
- Collaborators
- Peer reviewers & journal editors
- Broad scientific community
- General public
What Should We Share?
Consider a recent project or paper of yours:
- Which parts are important to publish?
- Which parts are less important to publish?
- Which parts are too sensitive / cannot be published?
Things We Should Publish
- Starting data set (if it’s not already available)
- Metadata
- Data cleaning steps
- Analysis scripts
- Source code
- Readme
Things We Should May or May Not Publish
- Raw data: especially if it’s already available
- Processed / cleaned data
- Intermediate results
Things We Shouldn’t Publish
- Confidential (e.g., patient) data
- Material already published
- Pre-existing restrictive license
- Passwords, private keys
When Should I Publish?
You can make your code and data public at any point of the research process. When you submit a paper, results should be reproducible and data & code should be published.
When Should I Publish?
- Journals often require code publication
- Allows editors & reviewers to accurately review methods
- You can often publish code for reviewers only to be publicly released when the paper is published
Many repositories
Registry of Research data Repositories
Pro_tip: Archival Repository is one that retains your data for a set period of time. Funding agencies often have requirements associated with duration of the archive.
How to Chose a Repository
- Is there a domain specific repository?
- What are the backup & replication policies?
- Is there a plan for long-term preservation?
- Can people find your materials?
- Is it citable? (does it provide DOIs)
- Is your purpose archival, sharing or publication?
Where to Publish Various Parts of a Project
You will likely have different project components:
- R Markdown, Jupyter notebook, etc.
- Source code
- Other documentation
- Raw data
- Derived data
Where to Publish Various Parts of a Project
Example (Python focused) workflow:
- Develop code: GitHub
- Upon Publication:
- Share notebooks on GitHub, Notebook Viewer links
- Archive a snapshot of data in Dryad
- Code snapshot to Zenodo
Resources: Libraries Can Help
Pro-Tip: University and institution libraries often have resources for data management plans, repository access and data archiving.
Ask a Librarian!
A Reproducible Project Should Include
- Top-level
README
that describes data / software package
- List files & naming conventions
- Describe abbreviations, column names, etc
- Software installation & usage instructions
- Create separate
INSTALL
if long
- Citation instructions
- Contribution instructions
- Github will automatically link to
CONTRIBUTING
file for new issues and pull requests
Concerns about publishing data & code
- What are some of the challenges of publishing research products?
- What are some of the concerns that you may have?