Text and Data Mining at MIT
Text and data mining (TDM) are research techniques that use computational analysis to extract information from large volumes of text or data. It is an increasingly used research tool with a wide variety of applications, from studying music to predicting materials synthesis.
You will need two things for TDM: tools to do the analysis, and a corpus of material to analyze. This page provides information on corpora available for TDM to the MIT community. If you’re looking for analytical tools for TDM, the DiRT Directory is a good place to start. Many of the APIs listed in our API guide may also be useful.
TDM is frequently a fair use under US copyright law, but for many subscribed library resources it is restricted by license agreement. We continue to work with our database vendors to allow TDM in our licensed content. The following list includes licensed corpora available for TDM by the MIT community, as well as selected free corpora that may be of interest. If you’re interested in doing TDM on MIT-licensed content that isn’t listed below, or you know of a free resource you think we should include, please contact us at textmine@mit.edu.
American Archive of Public Broadcasting (AAPB)
Coverage: Digitized public radio and television programs and their metadata records available from the AAPB Online Reading Room
How it’s accessed: By API, see here
Access restrictions: None
Limitations: Some volume limitations may apply
For more information: https://github.com/WGBH/AAPB2#api
American Association for the Advancement of Science (AAAS)
Coverage: MIT-subscribed and open access content published by AAAS
How it’s accessed: Content may be downloaded for TDM directly from the AAAS online platform for local storage and analysis
Access restrictions: Subscribed content limited to MIT users and walk-in users physically present at MIT
Limitations: Downloading must be limited to a “reasonable rate and speed,” users must comply with the terms in Annex A here
For more information: http://www.sciencemag.org/subscribe/institutional-license-agreement
American Chemical Society (ACS)
Coverage: MIT-subscribed and open access publications by ACS
How it’s accessed: Content is delivered for local storage and analysis; users may use tools of their choice for analysis
Access restrictions: Limited to MIT users and their research collaborators, who must agree to and sign an agreement with ACS; to begin, contact textmine@mit.edu
Limitations: No limitations on volume, but users will need to provide information on the specific content they would like to mine (journal title and date range, or a list of DOIs)
For more information: textmine@mit.edu
American Physical Society (APS)
Coverage: MIT-subscribed and open access journals published by APS
How it’s accessed: TDM access can be arranged by request, MIT users should contact textmine@mit.edu
Access restrictions: Subscribed content limited to MIT users
Limitations: Some rate limits and restrictions may apply
For more information: textmine@mit.edu
arXiv Bulk Data Access
Coverage: e-print service in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, and statistics
How it’s accessed: Multiple access models supported; see: https://arxiv.org/help/bulk_data
Access restrictions: none
Limitations: none through supported access models, but unauthorized automated scraping from URLs is prohibited
Contact for technical questions: https://arxiv.org/help/contact
For more information: https://arxiv.org/help/bulk_data
BioMed Central
Coverage: biology, biomedicine and medicine
How it’s accessed: Multiple access models supported; see: http://old.biomedcentral.com/about/datamining
Access restrictions: none, but registration may be required for some access models
Limitations: none listed
Contact for technical questions: info@biomedcentral.com
For more information: http://old.biomedcentral.com/about/datamining
Brill Academic Publishers
Coverage: MIT-subscribed and open access content published by Brill
How it’s accessed: Content may be downloaded for TDM
Access restrictions: Subscribed content limited to MIT users
Limitations: Some rate limits and restrictions may apply
For more information: textmine@mit.edu
Caselaw Access Project
Coverage: All published US court decisions
How it’s accessed: In-browser API viewer or RESTful interface, some jurisdictions also available as bulk download
Limitations: Full text of cases limited to 500 cases per person per day, unless otherwise authorized. More on access limits here
For more information: https://case.law/
Chicago Defender Historical Archive
Coverage: Wall Street Journal content 1909-1975
How it’s accessed: Downloadable in XML format from Dataverse
Access restrictions: Limited to MIT users
Limitations: None
Contact for technical questions: dvn_support@help.hmdc.harvard.edu; Questions can also be posted in https://groups.google.com/forum/#!forum/dataverse-community
For more information: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/928QMU
Chronicling America
Coverage: US newspapers from 1789-1924
How it’s accessed: Accessible by API or bulk download
Access restrictions: none, no registration or API key required
Limitations: none
Contact for technical questions: help page
For more information: http://chroniclingamerica.loc.gov/
CORE
Coverage: Open access research papers harvested from member repositories and publications; not limited by subject. More information available here: https://core.ac.uk/dataproviders
How it’s accessed: API or bulk data download; more information available here: https://core.ac.uk/services
How to register: Free to use, but registration and API key required, register for API key at https://core.ac.uk/api-keys/register
Limitations: Quota applied for query volume, details at https://core.ac.uk/services#api
Contact for technical questions: theteam@core.ac.uk
For more information: https://core.ac.uk/services
CrossRef Text and Data Mining Services
Coverage: TDM access depends on publisher participation; not limited by subject; current participating publishers listed here: http://tdmsupport.crossref.org/
How it’s accessed: Uses the CrossRef Metadata API to access the full-text of published content for participating publishers
Limitations: Rate limits may be set by participating publishers
Contact for technical questions: support@crossref.org
For more information: http://tdmsupport.crossref.org/
CQ Press
Coverage: American government, politics, history, public policy, and current affairs, 1923-present
How it’s accessed: Content may be downloaded for TDM
Access restrictions: Limited to MIT users, alumni, and and walk-in users physically present at MIT
Limitations: Some rate limits and restrictions may apply
For more information: textmine@mit.edu
Dataverse
Coverage: Much data available through Dataverse is available for TDM, including content from the Harvard Dataverse Network, MIT Libraries-purchased data, and data deposited in other Dataverse Network repositories
How it’s accessed: Data may be downloaded for local analysis, or users may use a Dataverse API
Access restrictions: Access to restricted data sets requires approval by data owners. To access MIT Libraries-purchased data, login to Dataverse by selecting Massachusetts Institute of Technology and using your certificates or touchstone. More information available at: http://guides.dataverse.org/en/4.6/api/dataaccess.html#authentication-and-authorization
Limitations: No limitations on public data set downloads after agreeing to terms of use; no limitations on restricted data set downloads after access is granted by data owners
Contact for technical questions: dvn_support@help.hmdc.harvard.edu; questions can also be posted in https://groups.google.com/forum/#!forum/dataverse-community
For more information: http://guides.dataverse.org/en/4.6/user/index.html
Digital Public Library of America (DPLA) metadata
Coverage: metadata of content indexed through DPLA
How it’s accessed: DPLA metadata is accessible by API or as zipped JSON files for bulk download
How to register: Free to use; API key must be requested with information here: https://dp.la/info/developers/codex/policies/#get-a-key
Limitations: None
Contact for technical questions: codex@dp.la; Users can also submit issues to DPLA’s Issue Tracker
For more information: http://dp.la/info/developers/
Digital Theatre
Coverage: MIT-subscribed and open access content published by Digital Theatre
How it’s accessed: Contact textmine@mit.edu
Access restrictions: Subscribed content limited to MIT users
Limitations: Non-commercial use only, some rate limits and restrictions may apply
For more information: textmine@mit.edu
Early English Books Online Text Creation Partnership (EEBO-TCP)
Coverage: the EEBO TCP Phase I corpus: books printed in England, Ireland, Scotland, Wales and British North America and works in English printed elsewhere from 1473–1700
How it’s accessed: full-text access and search tools available to all via the University of Michigan EEBO-TCP site, downloadable full-text files available here, HTML, ePUB, and TEI P5 XML copies available through the Oxford Text Archive, and tiff files of page images available to MIT users here
Access restrictions: text versions of Phase 1 content are openly available for public use, page images and Phase II text limited to subscribing institutions
Limitations: no limitations on openly available data, access via ProQuest subject to terms of use
Contact for technical questions: University of Michigan EEBO help
For more information: http://quod.lib.umich.edu/e/eebogroup/
Eighteenth Century Collections Online Text Creation Partnership (ECCO-TCP)
Coverage: English-language and foreign-language titles printed in the United Kingdom during the 18th century, along with thousands of important works from the Americas (the ECCO-TCP corpus)
How it’s accessed: Multiple ways to access, listed here
Access restrictions: none
Limitations: none
Contact for technical questions: http://www.textcreationpartnership.org/contact/
For more information: http://www.textcreationpartnership.org/tcp-ecco/
Electrochemical Society
Coverage: MIT-subscribed Electrochemical Society publications
How it’s accessed: TDM access can be arranged on a per-project basis, MIT users can contact textmine@mit.edu to inquire
Access restrictions: Subscribed content limited to MIT users
Limitations: Some rate limits and restrictions may apply
For more information: textmine@mit.edu
Elsevier
Coverage: MIT-subscribed and open access content available on the ScienceDirect platform
How it’s accessed: via Elsevier APIs
Access restrictions: Subscribed content limited to MIT users; MIT users can contact textmine@mit.edu to receive an API key
Limitations: Non-commercial use only, some restrictions on use of TDM corpus and output
Contact for technical questions: integrationsupport@elsevier.com
For more information: https://dev.elsevier.com/text_mining.html; https://www.elsevier.com/about/our-business/policies/text-and-data-mining
Emerald Publishing
Coverage: MIT-subscribed and open access content available from Emerald Publishing
How it’s accessed: Accessible via CrossRef’s TDM service; users should inform support@emeraldinsight.com of the IP address that will be used for mining prior to beginning to avoid an IP block
Access restrictions: Subscribed content limited to MIT users
Limitations: Non-commercial use only; subject to terms in the TDM license
Contact for technical questions: support@emeraldinsight.com
For more information: http://www.emeraldinsight.com/page/tdm; http://www.emeraldinsight.com/page/tdmfaqs
Europeana APIs and Data Collections
Coverage: Wide variety of European content, selected openly available content listed here
How it’s accessed: searchable by web interface or four APIs available to allow access to metadata, annotation, and download Europeana data
Access restrictions: none
Limitations: Some content restricted by copyright and use is subject to terms; search results from web interface can be filtered by reuse rights, and data in featured datasets is freely available; API use subject to API Terms of Use
Contact for technical questions: API Google groups page
For more information: http://labs.europeana.eu/api; http://labs.europeana.eu/data
Evans Early American Imprint Collection Text Creation Partnership (Evans-TCP)
Coverage: 6,000 of the most frequently studied books from the Evans Early American Imprints Collection
How it’s accessed: Evans-TCP web interface
Access restrictions: none
Limitations: none
Contact for technical questions: http://www.textcreationpartnership.org/contact/
For more information: http://quod.lib.umich.edu/e/evans/
Google Books
Coverage: Large corpus of > 25 million scanned books from libraries and publishers, including foreign language corpora
How it’s accessed: Multiple ways to access, including third party tools: search via Google Books web interface, Ngram Viewer, BYU Google Books viewer, Culturomics Bookworm Viewer
Access restrictions: none
Limitations: TDM output limited to snippet view for in-copyright works
Contact for technical questions: Google Books help
For more information: https://books.google.com/intl/en/googlebooks/about/
HathiTrust Datasets
Coverage: Two large corpora of scanned works available for download: a non-Google corpus of >550,000 primarily English-language public domain volumes published prior to 1923, and a Google-digitized corpus of >4.8 million public domain in a wide variety of languages, subjects, and dates (see visualizations of coverage); custom datasets also available
How it’s accessed: All content available for search via web interface, some content also available for download or via API
Access restrictions: All text is searchable, but output results are limited for some works; no restrictions on download of non-Google public domain corpus, download of Google-digitized corpus is restricted to participating institutions
Limitations: Web search output results are limited for in-copyright works; information on limitations for downloaded corpora available here
Contact for technical questions: feedback@issues.hathitrust.org
For more information: https://www.hathitrust.org/datasets
HeinOnline
Coverage: Full-text legal history collection from 1700-present including legal journals, books, world constitutions, treaties, US Supreme Court reports, US Code, Statutes at Large, Code of Federal Regulations, Congressional Record, presidential papers, Foreign Relations of the United States, federal agency reports and records, Philippine law collection, resources for researching legislative histories, and 5th-7th editions of Leiter’s “National Survey of State Laws”
How it’s accessed: TDM access can be arranged on a per-project basis, MIT users can contact textmine@mit.edu to inquire
Access restrictions: Subscribed content limited to MIT users
Limitations: Some rate limits and restrictions may apply
For more information: textmine@mit.edu
Institute of Physics (IOP)
Coverage: MIT-subscribed and open access journals published by IOP
How it’s accessed: TDM access can be arranged by request, MIT users should contact textmine@mit.edu
Access restrictions: Subscribed content limited to MIT users
Limitations: Some rate limits and restrictions may apply
For more information: textmine@mit.edu
Internet Archive eBooks and Texts
Coverage: Over 11 million fully accessible books and texts
How it’s accessed: Searchable by web interface, with multiple download formats for individual works; instructions for a method for bulk download here
Access restrictions: none
Limitations: No stated technical limitations; subject to terms of use
Contact for technical questions: info@archive.org
For more information: https://archive.org/details/texts
JSTOR Data for Research
Coverage: JSTOR’s scholarly journal and primary resource collections
How it’s accessed: web interface: http://dfr.jstor.org/
Access restrictions: Free to access, registration is required to obtain results; no institutional affiliation is required
Limitations: Datasets are capped by default at 1,000 articles; users seeking larger results are asked to contact JSTOR Data for Research
Contact for technical questions: http://about.jstor.org/contact
For more information: http://about.jstor.org/service/data-for-research
LexisNexis Academic
Coverage: news, business, and legal resources
How it’s accessed: Content may be downloaded for TDM via the web interface
Access restrictions: Access limited to MIT users
Limitations: Content may be downloaded from search results in batches up to a 500 record batch limit, and must be deleted after 90 days; scripting of batch downloads is not permitted
For more information: textmine@mit.edu
Library of Congress
Coverage: Library of Congress Digital Collections and bibliographic metadata
How it’s accessed: A JSON API is available to provide programmatic access to Library of Congress Digital Collections in JSON format, data sets can be downloaded for Library of Congress bibliographic data in UTF8, MARC8 and XML formats, and several additional APIs are available to search and manipulate Library of Congress online content
Access restrictions: None
Limitations: Some batch limits may apply depending on specific tool used
For more information: https://labs.loc.gov/lc-for-robots/
Mendeley
Coverage: reference manager and academic social network
How it’s accessed: Content on Mendeley may be extracted using spiders, bots, or other systems
Access restrictions: MIT users of Mendeley Institutional Edition
Limitations: Noncommercial use only, and use must not interfere with the functioning of the Mendeley site
For more information: textmine@mit.edu
music21
Coverage: Corpus of encoded public domain and openly licensed musical compositions; full list available here: http://mit.edu/music21/doc/about/referenceCorpus.html
How it’s accessed: Install the music21 toolkit and access via Python
Access restrictions: None
Limitations: None
Contact for technical questions: http://groups.google.com/group/music21list
For more information: http://mit.edu/music21/
National Library of Medicine (NLM)
Coverage: Corpora and tools for accessing various NLM databases and biomedical literature
For more information: https://wwwcf.nlm.nih.gov/nlm_eresources/eresources/search_database.cfm; https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#
Notable included corpora:
Digital Collections Web Service
Coverage: metadata and full-text OCR of all resources in the Digital Collections repository
How it’s accessed: HTTP requests using structured URL queries
Access restrictions: none
Limitations: 85 requests per minute per IP address; contact NLM for larger projects
Contact for technical questions: https://support.nlm.nih.gov/ics/support/ticketnewwizard.asp?style=classic&deptID=28054
For more information: https://collections.nlm.nih.gov/web_service.html
PMC Open Access Web Service
Coverage: Allows discovery of downloadable fulltext resources from the PMC Open Access Subset
How it’s accessed: Available for download via FTP or API
Result format: XML results showing articles available in tgz or PDF format
Limitations: Result set limited to 1000 records at a time
Contact for technical questions: pubmedcentral@ncbi.nlm.nih.gov
For more information: https://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/
Nature
Coverage: MIT-subscribed and open access publications by Nature
How it’s accessed: Content may be downloaded directly from the Nature online platform, including by automated download
Access restrictions: Limited to MIT users and research collaborators
Limitations: Download of content should not exceed 1 document per second; TDM rights are limited to non-commercial use
For more information: textmine@mit.edu
New York Times
Coverage: metadata and some content from New York Times articles 1851-present
How it’s accessed: Multiple APIs are available for different uses, full list here
Access restrictions: Free to access with registration and acceptance of terms of use
Limitations: Noncommercial use only, and users must agree to terms of use; API calls limited to 1,000 calls per day, and 5 calls per second
Contact for technical questions: code@nytimes.com
For more information: http://developer.nytimes.com/
Oxford Text Archive
Coverage: full catalog and selected corpora here
How it’s accessed: Searchable by web interface, with multiple download formats for individual works; curated corpora also available
Access restrictions: Most texts are free to use; some subject to depositor restrictions
Limitations: User must agree to user agreement
Contact for technical questions: ota@it.ox.ac.uk
For more information: http://ota.ox.ac.uk/
PLOS
Coverage: Every PLOS article, including all Articles and Front Matter. Updated daily. Does not include Figures or Supplemental Data
How it’s accessed: Bulk download here
Access restrictions: none
Limitations: No stated technical limitations; content under CC-BY license
For more information: https://www.plos.org/text-and-data-mining
Project Gutenberg
Coverage: >53,000 books, primarily in the public domain (pre-1923)
How it’s accessed: Individual works are downloadable in multiple formats from the Project Gutenberg website; bulk downloading is permitted via mirroring or wget, more information available here
Access restrictions: none
Limitations: Most works are available for use without restriction, but in-copyright works may have individual restrictions; users must agree to terms of use
Contact for technical questions: Contact information
For more information: http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages
ProQuest
Coverage: Some content from ProQuest, including historical newspapers, is available for TDM; MIT users can contact textmine@mit.edu to inquire about specific content
How it’s accessed: Content is delivered for local storage and analysis; TDM is prohibited through the online interface
Access restrictions: Limited to MIT users
Limitations: Content must be purchased, delivered, and locally stored by MIT; process may take six months or longer
For more information: textmine@mit.edu
Royal Society of Chemistry
Coverage: MIT-subscribed and open access content published by the Royal Society of Chemistry
How it’s accessed: TDM access can be arranged by request, MIT users should contact textmine@mit.edu
Access restrictions: MIT subscribed content limited to MIT users
Limitations: Noncommercial use only
For more information: textmine@mit.edu
Sage Publications
Coverage: MIT-subscribed and open access content published by Sage
How it’s accessed: Content may be downloaded for TDM
Access restrictions: MIT subscribed content limited to MIT users
Limitations: Noncommercial use only
For more information: textmine@mit.edu
SAO/NASA Astrophysics Data System (ADS)
Coverage: Three bibliographic databases of publications in astronomy, astrophysics, physics, and all content included in the arXiv e-prints
How it’s accessed: Multiple interfaces provided: Bumblebee, classic, and browsable interfaces, as well as an API
Access restrictions: none
Limitations: systematic downloading of content prohibited except through provided API, API subject to rate and scope limits; researchers seeking to regularly download and store copies of API results should contact ADS first
Contact for technical questions: adshelp@cfa.harvard.edu
For more information: http://adswww.harvard.edu/
Springer
Coverage: MIT-subscribed and open access content on the SpringerLink platform
How it’s accessed: Content may be downloaded for TDM directly from SpringerLink, and downloading may be automated for that purpose; Springer APIs may be used to identify desired content for download
Access restrictions: Must be part of a subscribing institution
Limitations: Non-commercial use only, users should adhere to the Springer TDM policy
Contact for technical questions: mikail.shaikh@springer.com, support.api@springer.com
For more information: https://www.springer.com/gp/rights-permissions/springer-s-text-and-data-mining-policy/29056
Taylor & Francis
Coverage: MIT-subscribed and open access journals published by Taylor & Francis
How it’s accessed: TDM access can be arranged by request, MIT users should contact textmine@mit.edu
Access restrictions: Subscribed content limited to MIT users
Limitations: Some rate limits and restrictions may apply
For more information: textmine@mit.edu
Wall Street Journal Historical Archive
Coverage: Wall Street Journal content 1889-1934
How it’s accessed: Downloadable in XML format from Dataverse
Access restrictions: Limited to MIT users
Limitations: None
Contact for technical questions: dvn_support@help.hmdc.harvard.edu; Questions can also be posted in https://groups.google.com/forum/#!forum/dataverse-community
For more information: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XAUHMH
Wiley
Coverage: MIT-subscribed and open access content in the Wiley Online Library
How it’s accessed: Accessible via CrossRef’s TDM service; RESTful interface
Access restrictions: Must be part of a subscribing institution, and agree to a click-through Agreement
Limitations: Rate-limits implemented through CrossRef rate-limiting headers, exact limitations not specified
Contact for technical questions: TDM@wiley.com; labs@crossref.org for support using the CrossRef TDM service
For more information: http://olabout.wiley.com/WileyCDA/Section/id-826542.html
World Digital Library
Coverage: Primary source materials from many cultures and countries, representing over 100 different languages
How it’s accessed: Multiple access methods supported, see: http://api.wdl.org/
Access restrictions: None
Limitations: None
For more information: http://api.wdl.org/; https://labs.loc.gov/lc-for-robots/
Special thanks to Caroline Muglia of USC Libraries, whose text and data mining libguide helped us find many of the free TDM resources in this list.