Open Science and Research

The problems:

There is a growing consensus that there is widespread scientific fraud.
Good science requires replicability. To understand and replicate a result from a typical ed-tech study might require access to:
- The same software environment the educational technology ran in.
- The curricular materials used.
- The data generated by the students.
It’s easy to do bad science by accident. For example, if you run 50 studies, and find 2 significant values with p=0.05, odds are those came out positive by chance.
Lab notebooks, which used to keep an archival record of the scientific process, have fallen out-of-fashion.

The solution? Transparency! (and specifically open-science)

For the rest of this session, you are an open-science activist. You believe:

Confirmatory research should be preregistered.
Confirmatory research should happen before a data pool is polluted by exploratory research.
All source code for the educational technology should be free an open source software, so you can understand the context the research was done in and replicate it.
All curricular materials should be open, again, for replicability and understanding the study.
All analyses run as part of the study should not only be open-source (for replicability) but logged, to prevent p-value hunting.

Talk through what you would expect from a repository. A few frictions:

You have a particular allergy to “right to be forgotten” requests since loss of data prevent scientific replication.
Indeed, you believe research data should be widely shared.
Proprietary software prevents replicability (e.g. data from a Pearson ed-tech system)
Machine learning models can be hundreds of gigabytes, and can evolve (even continuously!)
- Training data can be even bigger!
Educational data can include audio-video data, finegrained click-stream data, etc.
It’s impossible to replicate many classical educational technology results from 1960-2000 since those computers no longer exist or work, and the programs are lost.
It’s impossible to replicate results based on web services from the 2000’s which no longer exist (or even ones which changed significantly, such as Khan Academy).
Data, without context and interpretability, is useless.
Discoverability. How do people find the data they need?
Related to the above, what other context is needed to find and make use of data?

Discuss what you’d want from a data repository, and how you’d build it out to support open science, research integrity, and research transparency.

A good, broader set of guidelines we’d like to follow are IES SEER