Applied data science, machine learning, and artificial intelligence research often require computational use of data. FAIR data is machine findable, accessible, interoperable, and reusable. Open repositories and open-data aggregators, such as those indexed in re3data.org and FAIRsharing.org, increasingly provide access to datasets that more successfully meet the standards laid out in the FAIR principles. However, many of these FAIR datasets needed for computational research still may not be open, their access restricted by a fee or other barriers.
These FAIR but not open datasets can include:
- datasets “closed as necessary” due to protection of personal privacy, national security, and competitiveness
- well-structured and curated texts and data from proprietary sources, such as:
- primary research articles available for text mining (e.g., Text and Data Mining at MIT)
- reference books (e.g., Landolt-Börnstein ) and datasets (e.g., NIST SRM)
- curated databases of data extracted from literature (e.g., Reaxys, GlobalData Power, GooglePatents, etc.)
- in-house databases from industry (e.g., pharmaceutical companies, data analytics companies, etc.)
Here are some resources for working with these two types of datasets that can be FAIR but not necessarily open.
To use curated data from proprietary sources, oftentimes, you will need to discuss and negotiate terms and conditions for computational access and use, including additional fees, with the providers of these sources. If you need computational access and use of an MIT subscribed resource, please contact email@example.com or your subject specialist. You can learn more about technical and legal challenges of using data and texts not in the open from this Force11 Scholarly Communication Institute 2020 workshop and evaluate the FAIRness of them with the assessment tool from the workshop.