THE DATA
Media Credit: Center for Geospatial Information Sciences
About the Data – SEDA 2023 Version 5.0
Our dataset comes from The Educational Opportunity Project at Stanford University, a national database that contains rich data for insight into academic performance. Among these datafiles is SEDA 2023, a dataset for the 2019-2023 Education Recovery project. This resource has data files, technical documentation, and data codebooks that are available to the public. The SEDA 2023 Version 5.0 that we use contains files on technical documentation and dataset that includes yearly pooled math and rzeading achievement scores at the administrative school district level, organized by state, and subgroup (race/ethnicity and socioeconomic status).

Data Sources and Collection Method
It is designed to provide data for researchers, policymakers, educators, and parents to address educational opportunities for all children. SEDA2023 is part of a larger partnership with Harvard University’s Center for Education Policy Research, aimed at analyzing school district average achievement in 2023—three years after the onset of COVID-19 pandemic. The Bill and Melinda Gates Foundations provided a grant for the construction of SEDA 2023, with additional research support from the Carnegie Corporation of New York, Bloomberg Philanthropies, and Kenneth C. Griffin. The data on student achievement in each state was provided by The National Center for Education Statistics (NCES) and the National Assessment Governing Board.
The first step of creating this dataset is getting a school-to-district crosswalk, which assigns each school to its operating administrative district. Next, data were cleaned by removing cases with low participation rates, incomplete subgroup information, Primary Testing Rate < 95%, or excessive use of alternate assessments. This step ensured that only data meeting quality was carried forward. After that, using Heteroskedastic Ordered Probit (HETOP) models, scores were estimated for each state, subject, grade, and year. The prepared data were used to estimate mean test scores at both the district and state levels using pooled statistical models. Two sets of estimates were produced—one using ordinary least squares (OL) and another using Empirical Bayes (EB) methods. Next, the final test score estimates were standardized relative to the 2019 national average (Year Standardized scale) and adjusted to reflect average per-grade growth (Grade Year Standardized scale). Finally, changes over time (pre- and post-pandemic) were calculated, and data that did not meet reliability thresholds were suppressed or flagged.
Application of Data
Our dataset can help us compare pre-pandemic academic performance with post-pandemic academic performance in two states that implemented different COVID shutdown and masking policies— California and Ohio, as it has information from 2019 (pre-pandemic) and 2022, 2023 (2-3 years after the pandemic). Because we have test score estimates from 2022 and 2023, we can gain insight into the academic recovery since schools have started to return to fully in-person learning. The two primary sources for SEDA2023 are EDFacts, which collects aggregated test scores from each state’s standardized testing programs, from grades 3-8 in math and reading language arts, and state-reported accountability data, which are publicly reported district proficiency data. Both are federal responsibilities, with EDFacts required by federal law and state-reported scores falling under federal accountability.
Limitations and Exclusions
It is important to note that there is some information that was left out from the SEDA2023 dataset. The dataset, focused on educational lacks individual-level details such as socioeconomic status and specific educational needs due to privacy concerns. It also omits data on school ratings, teacher qualifications, and other qualitative factors, alongside external elements like community support and parental involvement. This absence of information restricts deeper analyses on how local policies and the COVID-19 pandemic have impacted academic performance.
Critically, the dataset does not include high school data, only covering grades 3-8. This omission prevents analysis of how the pandemic affected students in high school, who face unique challenges such as college entrance exam preparation and significant life decisions. The lack of high school data limits understanding of changes in college readiness and the effectiveness of remote learning for older students.
Lastly, the dataset primarily relies on aggregated test score data, which does not track individual student progress over time and is confined to binary representations of race and gender, alongside socioeconomic status aggregated at the district level. This focus on test scores as the primary measure of success overlooks broader educational and developmental factors, potentially leading to a narrowed understanding of educational success.
Exploratory Data Analysis (EDA)
In our project, we chose to focus on California and Ohio. To understand our data composition before developing visualizations, we performed exploratory data analysis, uncovering count of scores collected and distribution between english and math test scores.

This bar graph displays the number of students tested in California (CA) vs Ohio (OH) in 2023. CA has more students tested overall than Ohio, but both have substantial amounts of data collected.

The two standardized test scores collected were english language arts and math. In this visualization, we aim to uncover the distribution of subject scores collected in both California and Ohio. We observe that California’s collected scores are almost evenly distributed, with 51.9% of collected scores being the math subject and the remaining 48.1% of scores being english. On the other hand, the distribution in Ohio is less even, with 66.1% of collected scores being English and the remaining 33.9% scores being math. Our goal is to be conscious of our data’s features and their distributions before proceeding with further analysis.