Notes: Reproducible Science IG Call Aug 17

Reproducible Science Aug 17

Please reach out to jon@numfocus.org if you would like to have 20 minutes to discuss a project in a future interest group call!

Attending

  • Jonathan Starr (NumFOCUS)
    • Program manager for OSSci at NumFOCUS.
    • Part of a startup building software to address the reproduction and replication crisis.
  • Tim Bonnemann
    • Based in San Jose, California, but currently in Germany.
    • Community Lead, Open Source Science (OSSci)
    • Works at IBM Research.
  • Ian Buckley
    • Community and partnerships lead at Agnostic, a startup in Toronto, Canada.
    • Agnostic is behind Covalent, an open-source workflow orchestration platform for HPC (High-Performance Computing) and quantum computing.
    • Covalent is quantum-ready and can be used for any high-performance computing workflow orchestration, including simulations, optimizations, and ML/AI models.
  • Mike Croucher
    • Works at Mathworks.
    • Prior experience of 20 years in academia.
    • Was part of the research software engineering movement which focused on computing in research groups.
    • Joined Mathworks 3 years ago with an interest in interfacing Matworks with open source.
  • Travis Wrightsman
    • Graduate student at Cornell University in Plant Breeding Genetics.
    • Works on machine learning models to predict plant cell traits from DNA sequences.
    • His research tries to decipher how DNA regions near a gene can influence the gene’s expression.
    • Aims to utilize findings in an agricultural context to improve crop breeding.
  • Chris Erdmann
    • Associate Director for Open Science at Michael J. Fox Foundation.
    • Has a long-standing history in the Open Science and software community.
    • His role involves ensuring reproducibility and quality of research funded by the Michael J. Fox Foundation.
    • Working on a strategy for the organization’s approach to open science and reproducibility.
  • Alexy
    • Director of OSSci.

Major Discussion Points:

Introduction to the Reproducibility Topic:

  • Highlighted the significance of reproducibility in the scientific community.
  • Discussed the social technical challenges tied to defining tools and standards for reproducibility.
  • Introduced reproducibility as a two-sided market, focusing on the differing needs based on the reproducer’s perspective.
  • A request for viewpoints on the significance of reproducibility across various sectors.

Reproducibility in Machine Learning:

  • Addressed the obstacles when non-experts try to reproduce machine learning models.
  • Underlined the value of code accessibility and its adaptability to diverse applications.
  • Noted the distinction between model sharing and genuinely achieving reproducible results.
  • Introduced a project, “MLC@Home”, aiming to close the gap between theoretical models and actual outcomes.

Sharing Models and Best Practices:

  • Emphasized the quality of information when disseminating models.
  • Pointed out the current inadequate model sharing practices in specific domains.
  • Presented the Open Modeling Foundation and its objectives.
  • Collaborative efforts with platforms to set standards without resorting to proprietary tools.
  • Highlighted the importance of sharing and collaboration, particularly in Parkinson’s research.

Reproducibility’s Broad Impact and Ecosystem:

  • Defined reproducibility as ensuring data and code are accessible and usable.
  • Discussed the varied forms of reproducibility depending on user needs.
  • Stressed feedback from varied stakeholders: funders, researchers, institutions.
  • Worked on identifying and collaborating with organizations focusing on reproducibility.
  • Explored the intricate relationship between model sharing and reproducibility, especially in the realm of machine learning.

Reproducibility in Industrial Settings & Basic Measures:

  • Pointed out the risks of using non-reproducible machine learning models in industries.
  • Reflected on foundational steps towards reproducibility, like sharing code/data, and the origin and achievements of the research software engineering idea.

Metrics and Levels of Reproduction:

  • Discussed the scientific community’s progress in ensuring basic reproducibility.
  • Proposed the establishment of metrics for reproducibility across varied disciplines.
  • Discussed the complexities and categories in determining reproducibility.

Case Study & Open Science Indicators:

  • Introduced the “Aligning Science Across Parkinson’s” initiative as a valuable case study.
  • Discussed the use of expansive open science indicators to track reproducibility progress.

Complex Model Sharing.

  • Explored challenges in sharing intricate models and workflows, highlighting solutions like “Covalent”.

Role of MathWorks & Persistence Concerns:

  • Presented MathWorks’ commitment to aiding the research community and inquired about possible improvements.
  • Brought up the feature allowing MATLAB code opening directly from GitHub.
  • Tackled concerns about long-term accessibility and persistence of shared links and resources.

Collaboration, Education, and Cultural Shift:

  • Highlighted the value of education and feedback in advancing reproducibility.
  • Emphasized the necessity of incorporating reproducibility into the research culture and proposed indicators to signal research without associated software or data.

Resources and Links Shared During Discussion

Action items

1 Like

Enjoyed being part of the call and wished I could have stayed for the whole thing!

The part on complex model sharing really resonates with me as an machine learning applier because I commonly find myself spending a lot of time moving back and forth between having flexible code to rapidly iterate with and scalable code that can test lots of models quickly. Covalent looks like a really cool solution to this; will have to try it out.

In terms of reproducibility, it would be nice as a researcher to have some sort of score attached to a paper which publishes a model I’d like to try out that evaluates how easy it will be for me to modify it for my problem. Maybe something like an Altmetric, but model accessibility. Maybe Open Modeling Foundation already has guidelines for this?

2 Likes

That’s a great question! @christopher.c.erdman might know more on whether OMF has existing guidelines.

I’d love to see multiple scores developed that use a variety of metrics that might satisfy a viewers needs in various contexts. I think a lot of issues with reproducibility and “scoring” science in general is that reality is so much more complex than a single score. And as was brought up during the call (sad you had to drop out early!), reproducibilty means different things to different “users” of science based on, among other this, the context behind their interaction with the research or knowledge. There has to be variety and choice, which would mean developing standards and tools that “score” or “impact” developers can use to experiment.

And for Covalent, I think Ian or someone else from the team might give a presentation on it in a couple calls, so keep an eye out!

RE: OMF, best practices/guidelines are in development. But speaking to the score comment, we showed the radar chart that is part of our reports from Dataseer regarding the outputs in a paper. It is a visual at least, but otherwise, we can look at indicators.

1 Like