MIT Libraries logo MIT Libraries

MIT logo Search Contact

Text and Data Mining at MIT

Text and data mining (TDM) are research techniques that use computational analysis to extract information from large volumes of text or data. It is an increasingly used research tool with a wide variety of applications, from studying music to predicting materials synthesis.

You will need two things for TDM: tools to do the analysis, and a corpus of material to analyze. This page provides information on corpora available for TDM to the MIT community. If you’re looking for analytical tools for TDM, the DiRT Directory is a good place to start. Many of the APIs listed in our API guide may also be useful.

TDM is frequently a fair use under US copyright law, but for many subscribed library resources it is restricted by license agreement. We continue to work with our database vendors to allow TDM in our licensed content. The following list includes licensed corpora available for TDM by the MIT community, as well as selected free corpora that may be of interest.  If you’re interested in doing TDM on MIT-licensed content that isn’t listed below, or you know of a free resource you think we should include, please contact us at


Special thanks to Caroline Muglia of USC Libraries, whose text and data mining libguide helped us find many of the free TDM resources in this list.