Author ORCID Identifier


Document Type


Publication Date



Text data mining, TDM, Fair use, United States, Technology, European Union, Machine learning, AI, Non-expressive use, Expressive use, Copyright


Now that the dust has settled on the Authors Guild cases, this Article takes stock of the legal context for TDM research in the United States. This reappraisal begins in Part I with an assessment of exactly what the Authors Guild cases did and did not establish with respect to the fair use status of text mining. Those cases held unambiguously that reproducing copyrighted works as one step in the process of knowledge discovery through text data mining was transformative, and thus ultimately a fair use of those works. Part I explains why those rulings followed inexorably from copyright's most fundamental principles. It also explains why the precedent set in the Authors Guild cases is likely to remain settled law in the United States.

Parts II and III address legal considerations for would-be text miners and their supporting institutions beyond the core holding of the Authors Guild cases. The Google Books and HathiTrust cases held, in effect, that copying expressive works for non-expressive purposes was justified as fair use. This addresses the most significant issue for the legality of text data mining research in the United States; however, the legality of non-expressive use is far from the only legal issue that researchers and their supporting institutions must confront if they are to realize the full potential of these technologies. Neither case addressed issues arising under contract law, laws prohibiting computer hacking, laws prohibiting the circumvention of technological protection measures (i.e., encryption and other digital locks), or cross-border copyright issues. Furthermore, although Google Books addressed the display of snippets of text as part of the communication of search results, and both Authors Guild cases addressed security issues that might bear upon the fair use claim, those holdings were a product of the particular factual circumstances of those cases and can only be extended cautiously to other contexts. Specifically, Part II surveys the legal status of TDM research in other important jurisdictions and explains some of the key differences between the law in the United States and the law in the European Union. It also explains how researchers can predict which law will apply in different situations.

Part III sets out a four-stage model of the lifecycle of text data mining research and uses this model to identify and explain the relevant legal issues beyond the core holdings of the Authors Guild cases in relation to TDM as a non-expressive use.

First Page


Publication Title

Journal of the Copyright Society of the U.S.A.


(©) Matthew Sag 2019.