Last week, I got to attend a series of presentations followed by a panel discussion on open science at Penn Van Pelt-Dietrich library during the Open Access week. The panel featured (from left to right) Ted Satterthwaite, Jennifer Sisto, Daniel Himmelstein – all initiated enlightening discussions around different aspects of open science.

Photo credit: Rebecca Miller

Jennifer Stiso motivates the open science best practices by recapitulating the reproducibility crisis. Stiso referenced Tal Yarkoni’s blog post I hate open science to remark the different values that the open science community holds near and dear: Yarkoni pointed out potential internal conflicts among the set values when we treat open science as a movement or identity, mostly due to the community’s heterogeneity. Perhaps I don’t see enough tweets with ongoing arguments about open science to understand Yarkoni’s frustration. However, I think open science is a useful umbrella term even when we focus on not all but only a few values such as reproducibility, accessibility and metascience.

Potential solutions to the [reproducibility problems] now fit under a big umbrella called “open science”

  • Reproducibility (reporting clarity, appropriate statistics)
  • Accessibility (preprints, open access journals)
  • Incentive Alignment (publishing null results)
  • Diversity (outreach)
  • Metascience (reporting clarity)

Some of these initiatives are in conflict with incentive structures inherent to science, making their widespread adoption difficult (but that doesn’t mean we shouldn’t try)

Stiso unbundled this definition into smaller chunks and shared the resources that tackle each chunk to help make science more open while facing minimal downsides. Details on these techniques and resources can be found in her slide linked at the end of this post among other relevant links. Two resources I want to highlight:

Complementing these techniques, Ted Satterthwaite provided an impressively concise yet comprehensive list of computational tools for reproducible science:

  • Pre-registration of hypotheses (OSF preregistration)
  • Standardized naming and directory structures (Brain Imaging Data Structure)
  • Version controlled code repositories (GitHub, Bitbucket)
  • Analytic notebooks (Jupyter notebook, Rmarkdown)
  • Continuous integration testing of software (CircleCI)
  • Containerized pipelines (Docker)
  • Clear, human-readable wikis (README.md)

Satterthwaite also suggested an interesting model with a “replication buddy” who helps double check whether our analyses are reproducible, from walking through our analytic notebooks to reviewing code and wikis.

Himmelstein delivered a lively talk on the history of Manubot – a software that supports open collaborative writing – and how it was inspired by the MOOP massively open online paper codenamed Deep Review. Some Github users actually came to this repository’s issues section to request for source codes and datasets of the works being referenced in the review paper, which showed how important immediate responses can be. He also shared his frustration when encountering legal barriers to reuse data to build the network of biomedical knowledge Hetionet, and how that motivated him to open source his products and encourage others to do the same.

I loved how interactive and intimate the discussion was, and the audience raised very relevant and important questions. We requested more details on many aspects; I only took notes of some here:

  • What is continuous integration? In this context, continuous integration is a way to automatically build the changes we made into the current status of the product. It is commonly used for automatically performing unit tests. In Manubot, continuous integration is used to build the living manuscripts almost immediately after the edits are merged.
  • Should I be worried that the journal won’t consider my article because it has a preprint? Most journals allow manuscripts that had been previously deposited to a preprint server. We can double check the list of academic journals by preprint policy.
  • What if I do not have permission to make the data I’m analyzing public? The philosophy here is to be “as open as possible and as closed as necessary”. Whether that involves publishing data with derived variables, using checksum, or simply being transparent with the code, do the best you can.

I myself asked one question about challenges in open peer review, and Himmelstein suggested that an incentive for more orthogonal (and correct) opinions may help alleviate the information cascade effect. After thinking about this a bit more, I’m not sure my question was relevant because, after all, with or without open peer review, readers can still openly post what they think on twitter, pubmed or the journal site anyway.

An older audience member commented “I’m fine with this because I’m more senior, but I’m worried for you guys”, and proceeded to explain that perhaps open science should not be the highest priority for junior investigators given the current academic incentives. This is true, and Satterthwaite actually did describe the risks when we do open science: it is time-consuming, the analysis is completely transparent for criticism, and the incentives are misaligned. For example, in a recent presentation, Anna Hatch showed that researchers think contribution to open science is the least important factor in tenure track hiring decisions. I think we all agree that doing open science is not without risks. However, I believe, with the changing academic culture (even though very slowly), the benefits will eventually outweigh the risks. A personal benefit was mentioned by Himmelstein in his recent talk at the 2019 Workshop for Penn’s GCB PhD Students:

You may not be the most experienced coder, but the process of sharing & documenting the code & engaging others, will, by the time you complete your PhD, almost certainly result in you having better code than if you didn’t.

Doing open science takes time, but that does not mean we shouldn’t do it.


2 To age or not to age | All posts | TPOT: Where do I start? 1