Open Science and Research
The problems:
- There is a growing consensus that there is widespread scientific fraud.
- Good science requires replicability. To understand and replicate a result from a typical ed-tech study might require access to:
- The same software environment the educational technology ran in.
- The curricular materials used.
- The data generated by the students.
- It’s easy to do bad science by accident. For example, if you run 50 studies, and find 2 significant values with p=0.05, odds are those came out positive by chance.
- Lab notebooks, which used to keep an archival record of the scientific process, have fallen out-of-fashion.
The solution? Transparency! (and specifically open-science)
For the rest of this session, you are an open-science activist. You believe:
- Confirmatory research should be preregistered.
- Confirmatory research should happen before a data pool is polluted by exploratory research.
- All source code for the educational technology should be free an open source software, so you can understand the context the research was done in and replicate it.
- All curricular materials should be open, again, for replicability and understanding the study.
- All analyses run as part of the study should not only be open-source (for replicability) but logged, to prevent p-value hunting.
Talk through what you would expect from a repository. A few frictions:
- You have a particular allergy to “right to be forgotten” requests since loss of data prevent scientific replication.
- Indeed, you believe research data should be widely shared.
- Proprietary software prevents replicability (e.g. data from a Pearson ed-tech system)
- Machine learning models can be hundreds of gigabytes, and can evolve (even continuously!)
- Training data can be even bigger!
- Educational data can include audio-video data, finegrained click-stream data, etc.
- It’s impossible to replicate many classical educational technology results from 1960-2000 since those computers no longer exist or work, and the programs are lost.
- It’s impossible to replicate results based on web services from the 2000’s which no longer exist (or even ones which changed significantly, such as Khan Academy).
- Data, without context and interpretability, is useless.
- Discoverability. How do people find the data they need?
- Related to the above, what other context is needed to find and make use of data?
Discuss what you’d want from a data repository, and how you’d build it out to support open science, research integrity, and research transparency.
A good, broader set of guidelines we’d like to follow are IES SEER