Scouring Millions of Papers to Create New Materials

Libraries help make articles accessible for text and data mining

Elsa Olivetti. Photo: Travis Gray.

As a scientist interested in developing sustainable methods to create materials, Elsa Olivetti wanted to data mine decades of scholarly articles to provide researchers what she calls “a toolkit of how materials have been made, to learn how we can improve how they are made or make new materials.”

Olivetti, MIT’s Atlantic Richfield Assistant Professor of Energy Studies, and her collaborators built an artificial intelligence natural language processing system that could extract just the information on materials creation that they needed from the literature, capable of scanning far more articles than a human could. The catch? Many of the journals they were interested in were not available for data mining. This is when Olivetti started working with Ellen Finnie, head of Scholarly Communication and Collections Strategy for MIT Libraries.

“MIT researchers have been interested in text and data mining for decades,” Finnie said. “But academic journals and major newspapers are often behind paywalls or available only in formats that prohibit large-scale analysis.”

The Libraries purchase access to journals and other materials for MIT through license agreements, and Finnie and her colleagues have been negotiating for text and data mining rights in these licenses. “It is still such a new approach for publishers to allow this kind of access that the Libraries are working on standard language for text mining in the legal agreements with publishers,” she said.

To support Olivetti’s research, Finnie and her team, including Scholarly Communications and Licensing Librarian Katie Zimmerman, worked with each publisher individually to gain access to a wide swath of research in materials science. As a result, Olivetti and her colleagues have been able to analyze more than 1.5 million articles. To date they have published three papers on this work in the journals Chemistry of Materials, npj Computational Materials, and Scientific Data.

Providing researchers with access to minable texts and data is a growing need, Finnie said. While her team does much of the work on licensing, other library staff take the actual data—which often arrives from a publisher in a large text file with no sortable markers—and put it into a format that will allow researchers to assess what is there and decide how to approach it. The barrier that Olivetti faced is one reason the Libraries are working to advance models for scholarly journal publication to become more open. “There are lots of different ways to move the environment towards more access and openness,” Finnie said.“And one way is to expand what is allowed through the license.”

Without the Libraries doing the licensing work, “the logistics would have been daunting,” Olivetti said. “Ellen and her team have been able to facilitate interactions with the publishers and move things along. If I had had to do all that on my own, I wouldn’t have started the project.”