Share, publish, & archive research products

authors: Reproducible Science Curriculum contributors, modified by NEON staff

Reproducibility Is Beneficial

  • To us as scientists
  • To the scientific community

Five selfish reasons to work reproducibly - Florian Markowetz

  1. Reproducibility helps to avoid disaster
  2. Reproducibility makes it easier to write papers
  3. Reproducibility helps reviewers see it your way
  4. Reproducibility enables continuity of your work
  5. Reproducibility helps to build your reputation

Share vs. Publish vs. Archive

Sharing, Publishing, & Archiving are not the same things

Share

SHARE: any way of sharing information – could mean I emailed it to you

Publish

PUBLISH: The data / code are citable & discoverable

Archive

ARCHIVE: Long-term preservation – there is a long term plan to store (and provide access to) the data / code

Publish & Archive

In this presentation, we’ll focus on publishing & archiving

Common Questions

  • Why publish?
  • Who are we sharing with?
  • What materials do we need to publish?
  • When do we make them available?
  • Where do we publish various outputs?
  • How do we prepare materials for publication?

Let’s assume that we are at the point of submitting our manuscript

Why Publish?

  • Increased visibility / citation
  • Funding agency / journal requirement
  • Community expects results from funded projects

Increased visibility / citation

Baby Orang Utan hanging from a rope

Figure 1: Citation density for papers with and without publicly available microarray data, by year of study publication. Source: Piwowar & Vision. 2013. Data reuse and the open data citation advantage

Requirements Can Vary

Why Publish / Share?

  • Increased visibility / citation
  • Funding agency / journal requirement
  • Community expects it
  • Better research
  • More efficient, less redundant science
    • Others can more effectively build upon your work

Why Share? Better Research.

Figure 2: Distribution of reporting errors per paper, for papers from which data were shared and from which no data were shared.

Source: Wicherts et al. 2011. Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results.

Who do we need to share with?

For research to be reproducible, the research products (data, code) need to be publicly available in a form that people can find and understand them.

  • Collaborators
  • Peer reviewers & journal editors
  • Broad scientific community
  • General public

What Should We Share?

Consider a recent project or paper of yours:

  • Which parts are important to publish?
  • Which parts are less important to publish?
  • Which parts are too sensitive / cannot be published?

Things We Should Publish

  • Starting data set (if it’s not already available)
  • Metadata
  • Data cleaning steps
  • Analysis scripts
  • Source code
  • Readme

Things We Should May or May Not Publish

  • Raw data: especially if it’s already available
  • Processed / cleaned data
  • Intermediate results

Things We Shouldn’t Publish

  • Confidential (e.g., patient) data
  • Material already published
  • Pre-existing restrictive license
  • Passwords, private keys

How To Decide What to Publish

Pro-Tip: Re-run your analyses: Make all of the data, code & notes needed to run your analysis, available.

Computing Workflows for Biologists: A Roadmap

When Should I Publish?

You can make your code and data public at any point of the research process. When you submit a paper, results should be reproducible and data & code should be published.

When Should I Publish?

  • Journals often require code publication
    • Allows editors & reviewers to accurately review methods
    • You can often publish code for reviewers only to be publicly released when the paper is published

Where Should I Publish?

Many repositories

Registry of Research data Repositories

Pro_tip: Archival Repository is one that retains your data for a set period of time. Funding agencies often have requirements associated with duration of the archive.

How to Chose a Repository

  • Is there a domain specific repository?
  • What are the backup & replication policies?
  • Is there a plan for long-term preservation?
  • Can people find your materials?
  • Is it citable? (does it provide DOIs)
  • Is your purpose archival, sharing or publication?

Where to Publish Various Parts of a Project

You will likely have different project components:

  • R Markdown, Jupyter notebook, etc.
  • Source code
  • Other documentation
  • Raw data
  • Derived data

Where to Publish Various Parts of a Project

Example (Python focused) workflow:

  • Develop code: GitHub
  • Upon Publication:
    • Share notebooks on GitHub, Notebook Viewer links
    • Archive a snapshot of data in Dryad
    • Code snapshot to Zenodo

Resources: Libraries Can Help

Pro-Tip: University and institution libraries often have resources for data management plans, repository access and data archiving.

Ask a Librarian!

How to share & publish: standard data formats

Pro-Tip: Using standard data formats increases opportunities for re-use and expansion of your research.

Document Format Considerations

Do Use:

  • Non-proprietary file formats
  • Text file formats (.csv, .tsv, .md, .txt)

Don’t Use:

  • Proprietary file formats (.xls)
  • Numeric data in PDFs or images (please, please don’t!)
  • Data in Word documents

A Reproducible Project Should Include

  • Top-level README that describes data / software package
  • List files & naming conventions
  • Describe abbreviations, column names, etc
  • Software installation & usage instructions
    • Create separate INSTALL if long
  • Citation instructions
  • Contribution instructions
    • Github will automatically link to CONTRIBUTING file for new issues and pull requests

Concerns about publishing data & code

  • What are some of the challenges of publishing research products?
  • What are some of the concerns that you may have?