The third annual MIT Prize for Open Data, which included a $2,500 cash prize, was recently awarded to 10 individual and group research projects. Presented jointly by the School of Science and the MIT Libraries, the prize highlights the value of open data — research data that is openly accessible and reusable — at the Institute. The prize winners and 9 honorable mention recipients were honored at the Open Data @ MIT event held Oct. 22 at Hayden Library.
The prize program was launched in 2022, spearheaded by Chris Bourg, director of MIT Libraries, and Rebecca Saxe, associate dean of the School of Science and the John W. Jarve (1978) Professor of Brain and Cognitive Sciences. It recognizes MIT-affiliated researchers who use or share open data, create infrastructure for open data sharing, or theorize about open data. Nominations were solicited from across the Institute, with a focus on trainees: undergraduate and graduate students, postdocs, and research staff.
“This year, we noted a number of submissions touching on strategic priorities for MIT, like artificial intelligence,” said Bourg. “President Sally Kornbluth has spoken repeatedly about MIT’s special responsibility to contain its risks and harness its power for good. One way to do that is to focus on improving the openness and transparency of the data that underpins AI models and tools, and we were pleased to see so many projects focused on that critical issue. There were also a number of projects that relate in some way to climate change, democracy, and human health.”
The 2024 awards were presented at a celebratory event held during International Open Access Week. Winners gave five-minute presentations on their projects and the role that open data plays in their research. The program also included remarks from Bourg and Nergis Mavalvala, dean of the School of Science and Curtis (1963) and Kathleen Marble Professor of Astrophysics. Mavalvala remarked how, in her own field of research, detecting gravitational waves, “instruments collect terabytes of data in the blink of an eye,” and open data drives this research forward. “People who make data usable by others are not celebrated enough,” she said.
Winners were chosen from more than 70 nominees, representing 25 different departments, labs, and centers across the Institute. A committee composed of faculty, staff, and graduate students made the selections:
- Awad Abdelhalim, assistant director of research, Urban Mobility and Transit Labs, won for the KhartouMap Initiative, along with collaborators Ilham Ali and Abubakr Ziedan. A Sudanese youth-led initiative catalyzing the modernization of Sudan’s public transit through mapping, open data, education, and innovation, KhartouMap is the first to fully map Khartoum’s semi-formal public transit system and provide open data on transit routes, usage, and opportunities for improvement.
- Faisal AlNasser, PhD candidate in Civil and Environmental Engineering (CEE), and Dara Entekhabi, Bacardi and Stockholm Water Foundations Professor, were recognized for the DustSCAN Dust Plumes Dataset, the first open-source collection tracking mineral dust plumes using satellite data across the global “Dust Belt.” The DustSCAN dataset has enabled significant advancements in dust research, facilitating the identification of dust sources and transport pathways, and fostering diverse scientific contributions.
- Mason Reece, PhD candidate in Political Science, presented on behalf of the team behind Cast vote records: A database of ballots from the 2020 U.S. Election. The team downloaded publicly available unstandardized cast vote records from the 2020 U.S. general election, standardized them into a multi-state database, and extensively compared their totals to certified election results. The release includes vote records for President, Governor, U.S. Senate and House, and state upper and lower chambers – covering 42.7 million voters in 20 states who voted for more than 2,200 candidates. The team also included Samuel Baltz, research scientist, MIT Election Data & Science Lab; Zachary Djanogly Garai, senior research associate, MIT Election Data & Science Lab; Kevin E. Acevedo Jetter, undergraduate student; Joseph R. Loffredo, PhD candidate in Political Science; and Charles H. Stewart III, Kenan Sahin Distinguished Professor of Political Science, as well as collaborators from Harvard, Yale, Columbia, and Duke.
- Lily Chen, an undergraduate student, won for FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence. FactPICO is a novel and open benchmark for factuality evaluation of plain language summarization of medical evidence including 345 LLM-generated summaries of randomized controlled trial abstracts, as well as fine-grained medical expert factuality assessments based on a PICO evaluation framework.
- Serena Patel, graduate researcher, MIT Energy Initiative, presented a dataset of the Operating Station Heat Rate for 806 Indian Coal Plant Units Using Machine Learning. The team also included Yifu Ding, postdoc, MIT Energy Initiative; Jansen Wong, undergraduate student; Guiyan Zang, research lead, MIT Energy Initiative; Robert Stoner, founding director, Tata Center for Technology and Design; and Dharik Mallapragada, assistant professor, NYU. Considering different factors, including water stress, coal price, coal age, and power capacity, the group created a station heat rate dataset for 806 Indian coal plant units using machine learning, presenting the most comprehensive coverage compared with previous databases.
- Mohamed Elrefaie, graduate student, Mechanical Engineering, MIT and Technical University of Munich; Faez Ahmed, d’Arbeloff Career Development Assistant Professor of Mechanical Engineering, MIT; Angela Dai, associate professor, Technical University of Munich; and Florin Morar, engineer at BETA CAE Inc., won for DrivAerNet: A Large-Scale, Multimodal, and High-Fidelity Dataset for Data-Driven Aerodynamic Design. DrivAerNet provides a comprehensive, large-scale multimodal car dataset with high-fidelity CFD simulations and deep learning benchmarks, enabling advanced aerodynamic analysis and design optimization.
- Hannah Jacobs, PhD candidate in Biology, won for her project, Widespread naturally variable human exons aid genetic interpretation, detecting naturally variable human exons in publicly available RNA sequencing data to aid in understanding of health and disease.
- Charlie Demurjian, lead data specialist, MIT BioMicro Center; Taisha Joseph, data specialist, MIT BioMicro Center; and Stuart Levine, director, MIT BioMicro Center, were recognized for the Data Management and Analysis Core for the MIT Superfund Research Program. They created infrastructure that handles thousands of datasets to enable effective sharing through open access.
- Joachim Schaeffer, a visiting graduate student at the MIT Energy Initiative and the Technical University of Darmstadt, won for a large lithium-ion battery field dataset, which includes 133 million rows of data from 28 battery systems. This is the first openly available dataset of batteries that failed in the field and enables further research into battery health monitoring and fault detection, which is important for battery safety.
- Yosuke Tanigawa, a research scientist in the Computer Science & Artificial Intelligence Lab, developed inclusive polygenic scores (iPGS), the first methodology applicable to everyone across the continuum of genetic ancestry, for genetic prediction of disease risks. He made pre-trained iPGS models as Open Data (CC-BY-4.0) on the figshare repository and developed the iPGS browser, which works as a hub of organized data, facilitates the interpretation of PGS models, and streamlines its downstream applications.
A complete list of winning projects and honorable mentions, including links to the research data, is available on the MIT Libraries’s Open Data website.