MIT Prize for Open Data
To highlight the value of open data at MIT, and to encourage the next generation of researchers, the MIT School of Science and the MIT Libraries present the MIT Prize for Open Data.
The following winners and honorable mentions were selected from more than 70 nominees representing 25 different departments, labs, centers, and institutes across MIT.
Winners
- Awad Abdelhalim, assistant director of research, Urban Mobility and Transit Labs, founder and co-director, the KhartouMap Initiative; Ilham Ali, founder and co-director, the KhartouMap Initiative; Abubakr Ziedan, head of Transit Data, the KhartouMap Initiative.
The KhartouMap Initiative
A Sudanese youth-led initiative catalyzing the modernization of Sudan’s public transit through mapping, open data, education, and innovation, KhartouMap is the first to fully map Khartoum’s semi-formal public transit system and provide open data on transit routes, usage, and opportunities for improvement. - Faisal AlNasser, PhD candidate, Civil and Environmental Engineering, and Dara Entekhabi, Bacardi and Stockholm Water Foundations Professor, Civil and Environmental Engineering and Earth, Atmospheric, and Planetary Sciences
DustSCAN Dust Plumes Dataset
The first open-source collection tracking mineral dust plumes using satellite data across the global “Dust Belt,” the DustSCAN dataset has enabled significant advancements in dust research, facilitating the identification of dust sources and transport pathways, and fostering diverse scientific contributions. - Samuel Baltz, research scientist, MIT Election Data & Science Lab; Aleksandra Conevska, PhD candidate in Government, Harvard University; Zachary Djanogly Garai, senior research associate, MIT Election Data & Science Lab; Shigeo Hirano, professor of Political Science, Columbia University; Kevin E. Acevedo Jetter, undergraduate student, MIT; Shiro Kuriwaki, assistant professor of Political Science, Yale University; Jeffrey B. Lewis, professor of Political Science, UCLA; Joseph R. Loffredo, PhD candidate in Political Science, MIT; Kate Murray, Master of Public Policy student, Duke University; Can E. Mutlu, PhD candidate in Government, Harvard University; Mason Reece, PhD candidate in Political Science, MIT; Taran Samarth, PhD student in Political Science, Yale University; James M. Snyder, Jr., Leroy B. Williams Professor of History and Political Science, Harvard University; Charles H. Stewart III, Kenan Sahin Distinguished Professor of Political Science, MIT
Cast vote records: A database of ballots from the 2020 U.S. Election
The team downloaded publicly available unstandardized cast vote records from the 2020 U.S. general election, standardized them into a multi-state database, and extensively compared their totals to certified election results. The release includes vote records for President, Governor, U.S. Senate and House, and state upper and lower chambers – covering 42.7 million voters in 20 states who voted for more than 2,200 candidates. - Lily Chen, undergraduate student, Mathematics, EECS
FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence FactPICO is a novel and open benchmark for factuality evaluation of plain language summarization of medical evidence including 345 LLM-generated summaries of randomized controlled trial abstracts, as well as fine-grained medical expert factuality assessments based on a PICO evaluation framework. - Yifu Ding, postdoc, MIT Energy Initiative; Serena Patel, graduate researcher, MIT Energy Initiative; Dharik Mallapragada, assistant professor, Department of Chemical and Biomolecular Engineering, NYU; Robert Stoner, founding director, Tata Center for Technology and Design; Jansen Wong, undergraduate student; Guiyan Zang, research lead, MIT Energy Initiative
A Dataset of the Operating Station Heat Rate for 806 Indian Coal Plant Units using Machine Learning
Considering different factors, including water stress, coal price, coal age, and power capacity, the group created a station heat rate dataset for 806 Indian coal plant units using machine learning, presenting the most comprehensive coverage compared with previous databases. - Mohamed Elrefaie, graduate student, Mechanical Engineering, MIT and Technical University of Munich; Faez Ahmed, d’Arbeloff Career Development Assistant Professor of Mechanical Engineering, MIT; Angela Dai, associate professor, Technical University of Munich; Florin Morar, engineer at BETA CAE Inc.
DrivAerNet: A Large-Scale, Multimodal, and High-Fidelity Dataset for Data-Driven Aerodynamic Design
DrivAerNet provides a comprehensive, large-scale multimodal car dataset with high-fidelity CFD simulations and deep learning benchmarks, enabling advanced aerodynamic analysis and design optimization. - Hannah Jacobs, PhD candidate, Biology
Widespread naturally variable human exons aid genetic interpretation
Detecting naturally variable human exons in publicly available RNA sequencing data to aid in understanding of health and disease. - Charlie Demurjian, lead data specialist, MIT BioMicro Center; Taisha Joseph, data specialist, MIT BioMicro Center; Stuart Levine, director, MIT BioMicro Center
Directing the Data Management and Analysis Core for the MIT Superfund Research Program
Creating infrastructure that handles thousands of datasets to enable effective sharing through open access. - Joachim Schaeffer, visiting graduate student, MIT Energy Initiative, Technical University of Darmstadt
Lithium-Ion Battery System Field Data, Article
A large lithium-ion battery field dataset, which includes 133 million rows of data from 28 battery systems. This is the first openly available dataset of batteries that failed in the field and enables further research into battery health monitoring and fault detection, which is important for battery safety. - Yosuke Tanigawa, research scientist, Computer Science & Artificial Intelligence Lab (CSAIL)
Inclusive Polygenic Score resources
Tanigawa developed inclusive polygenic scores (iPGS), the first methodology applicable to everyone across the continuum of genetic ancestry, for genetic prediction of disease risks. He made pre-trained iPGS models as Open Data (CC-BY-4.0) on the figshare repository and developed the iPGS browser, which works as a hub of organized data, facilitates the interpretation of PGS models, and streamlines its downstream applications.
Honorable Mentions
- Mohammed Alsobay
Empirica
An open-source virtual lab platform for interactive behavioral science experimentation. - Hezekiah Branch
Cortical-Basal Ganglia Speech Networks (CBGSN)
The largest existing dataset of direct intracranial subcortical recordings of neural activity during speech. - Charlie Cowen-Breen
Logion; paper
A dataset designed to accelerate the discovery of real errors in premodern texts using machine learning methods. - Rachel Luu
BioinspiredLLM: Open Access Generative AI Tools and Methods for Bio-inspired Materials Research and Beyond
An open-source large language model fine-tuned on a unique dataset derived from full-text articles in biological and bio-inspired materials. - Enrico Marchesini
Advancing Power Grid Operations with Open Data and Reinforcement Learning
RL2Grid is a framework that utilizes open grid data to model and benchmark decision-making in realistic power grid scenarios. - Pat Pataranutaporn
Open Dance Lab: Digital Platform for Examining, Experimenting, and Evolving Intangible Cultural Heritage
This collaborative endeavor with Pichet Klunchun, a renowned contemporary choreographer, focuses on examining Southeast Asian dance traditions that have not previously been subject to digital or computational research. - Edgar Ramirez Sanchez
NeuralMOVES
A faster and programmatic surrogate version of MOVES, the official emission model in the US. - Vineeth Venugopal
MatKG: An autonomously generated knowledge graph in Material Science
As the largest knowledge graph in materials science to date, MatKG provides structured organization of domain-specific data. Its deployment holds promise for various applications, including material discovery, recommendation systems, and advanced analytics.
2024 Committee
Committee Co-Chairs
- Chris Bourg, Director, MIT Libraries
- Rebecca Saxe, Associate Dean of Science, School of Science (SoS)
Committee Members
- Iain Cheeseman, Herman and Margaret Sokol Professor of Biology, SoS and Whitehead
- Fotini Christia, Ford International Professor in the Social Sciences and Institute for Data, Systems, and Society
- Satrajit Ghosh, Director of the Open Data in Neuroscience Initiative, McGovern Institute, and Director of Data Models and Integration, ReproNim
- Rafael Jaramillo, Thomas Lord Career Development Professor, Associate Professor of Materials Science and Engineering
- Nick Lindsay, Director of Journals and Open Access, MIT Press
- Peace Ossom, Director of Research Data Services, MIT Libraries
- Jack Payette, graduate student, Earth and Planetary Sciences, SoS
- Tom Pollard, research scientist, Laboratory for Computational Physiology
- Virginia Spanoudaki, Scientific Director, Preclinical Imaging and Testing, Koch Institute
- Jerik Cruz, graduate student, Political Science
- Paul Berube, research scientist, Civil and Environmental Engineering
- Sue Kriegsman, deputy director, Center for Research on Equitable and Open Scholarship (CREOS), MIT Libraries
- Steve Flavell, Associate Professor, Picower Institute for Learning & Memory and Department of Brain and Cognitive Sciences
- Sadie Roosa, Collections Strategist for Repository Services, MIT LIbraries
Questions? Email open-data-prize@mit.edu
Co-sponsored by the MIT School of Science and MIT Libraries