Can AI companies keep stealing books to train their large language models?

Can AI companies keep stealing books to train their large language models?

Alex Reisner writes:

Should tech companies have free access to copyrighted books and articles for training their AI models? Two judges recently nudged us toward an answer.

More than 40 lawsuits have been filed against AI companies since 2022. The specifics vary, but they generally seek to hold these companies accountable for stealing millions of copyrighted works to develop their technology. (The Atlantic is involved in one such lawsuit, against the AI firm Cohere.) Late last month, there were rulings on two of these cases, first in a lawsuit against Anthropic and, two days later, in one against Meta. Both of the cases were brought by book authors who alleged that AI companies had trained large language models using authors’ work without consent or compensation.

In each case, the judges decided that the tech companies were engaged in “fair use” when they trained their models with authors’ books. Both judges said that the use of these books was “transformative”—that training an LLM resulted in a fundamentally different product that does not directly compete with those books. (Fair use also protects the display of quotations from books for purposes of discussion or criticism.)

At first glance, this seems like a substantial blow against authors and publishers, who worry that chatbots threaten their business, both because of the technology’s ability to summarize their work and its ability to produce competing work that might eat into their market. (When reached for comment, Anthropic and Meta told me they were happy with the rulings.) A number of news outlets portrayed the rulings as a victory for the tech companies. Wired described the two outcomes as “landmark” and “blockbuster.”

But in fact, the judgments are not straightforward. Each is specific to the particular details of each case, and they do not resolve the question of whether AI training is fair use in general. On certain key points, the two judges disagreed with each other—so thoroughly, in fact, that one legal scholar observed that the judges had “totally different conceptual frames for the problem.” It’s worth understanding these rulings, because AI training remains a monumental and unresolved issue—one that could define how the most powerful tech companies are able to operate in the future, and whether writing and publishing remain viable professions. [Continue reading…]

Comments are closed.