Open Science and Research
The problems:
  - There is a growing consensus that there is widespread scientific fraud.
 
  - Good science requires replicability. To understand and replicate a result from a typical ed-tech study might require access to:
    
      - The same software environment the educational technology ran in.
 
      - The curricular materials used.
 
      - The data generated by the students.
 
    
   
  - It’s easy to do bad science by accident. For example, if you run 50 studies, and find 2 significant values with p=0.05, odds are those came out positive by chance.
 
  - Lab notebooks, which used to keep an archival record of the scientific process, have fallen out-of-fashion.
 
The solution? Transparency! (and specifically open-science)
For the rest of this session, you are an open-science activist. You believe:
  - Confirmatory research should be preregistered.
 
  - Confirmatory research should happen before a data pool is polluted by exploratory research.
 
  - All source code for the educational technology should be free an open source software, so you can understand the context the research was done in and replicate it.
 
  - All curricular materials should be open, again, for replicability and understanding the study.
 
  - All analyses run as part of the study should not only be open-source (for replicability) but logged, to prevent p-value hunting.
 
Talk through what you would expect from a repository. A few frictions:
  - You have a particular allergy to “right to be forgotten” requests since loss of data prevent scientific replication.
 
  - Indeed, you believe research data should be widely shared.
 
  - Proprietary software prevents replicability (e.g. data from a Pearson ed-tech system)
 
  - Machine learning models can be hundreds of gigabytes, and can evolve (even continuously!)
    
      - Training data can be even bigger!
 
    
   
  - Educational data can include audio-video data, finegrained click-stream data, etc.
 
  - It’s impossible to replicate many classical educational technology results from 1960-2000 since those computers no longer exist or work, and the programs are lost.
 
  - It’s impossible to replicate results based on web services from the 2000’s which no longer exist (or even ones which changed significantly, such as Khan Academy).
 
  - Data, without context and interpretability, is useless.
 
  - Discoverability. How do people find the data they need?
 
  - Related to the above, what other context is needed to find and make use of data?
 
Discuss what you’d want from a data repository, and how you’d build it out to support open science, research integrity, and research transparency.
A good, broader set of guidelines we’d like to follow are IES SEER