If you have a research project that involves large-scale computational processing of text, images, or proprietary data, there is a good chance you will need to navigate a complicated legal landscape of copyright and contracts.
Text and data mining is an explicit exception under some countries’ copyright laws; in the US, it falls under the copyright doctrine of fair use. Recent high-profile lawsuits, such as those against OpenAI, Meta, Stability AI, and Github, are questioning the boundaries of this fair use and highlight the legal challenges researchers face when AI and machine learning projects make use of proprietary sources.
In addition to copyright considerations, many sources of training data are only accessible via contracts that can restrict downstream uses. In the MIT Libraries, we negotiate for computational use rights for library resources, but many texts and databases are still locked behind prohibitive terms as well as cumbersome retrieval mechanisms.
Next week is International Open Access Week, whose theme this year is Community over Commercialization. Commercial vendors that lock information behind paywalls, deny reuse rights, and have restrictive interpretations of copyright law can limit how research communities use data and computational research tools. This can reduce our ability to generate new knowledge. Open access content and data is a route to more equitable computational access, use, and innovation across the world.
Learn more about all of this, and connect with researchers facing similar issues at this interactive in-person workshop on October 23 with visiting expert Dave Hansen, executive director of Authors Alliance. Authors Alliance is a nonprofit supporting authors who research and write for the public benefit. Hansen is a copyright expert who has worked extensively on legal barriers to research and is a PI for the Authors Alliance Text and Data Mining: Demonstrating Fair Use Project, which is generously supported by the Mellon Foundation.