AI Companies Are Trying to Get MIT Press Books

As several major publishers sell their authors’ works to tech giants for large language model fodder, MIT Press is asking authors for their input before any training deals are made, and claims that it’s been approached by AI companies to do so.

On November 7, MIT Press emailed its authors with the subject line “Response Requested: MIT Press author views on LLM training data and licensing.” In it, MIT Press says that it has been approached by “several AI companies and data brokers for training generative AI tools in exchange for payment.” It goes on to say that it has not entered into any such deal “thus far” but recognizes that MIT Press content is “already being used for training purposes.”

The full email is below:

“Dear MIT Press authors,

Like all other publishers, the MIT Press has been approached by several AI companies and data brokers about using the text from our publications—works that you have authored—as training data for generative AI tools in exchange for payment. We have thus far not entered into any such deals. At the same time, we recognize that MIT Press content is already being used for training purposes while the outstanding legal issues over the use of copyrighted content as training data are being litigated in a multitude of courts.

We understand the value AI companies may see in working directly with publishers to help collect, collate, and properly attribute training data. For our part, while no final decisions have been made, we would like to explore these opportunities for two key reasons: (1) the additional revenue stream is attractive at a time of growing financial challenge and uncertainty in the non-profit academic publishing sector; and (2) we want to ensure that the high-quality work of our authors enhances, and is represented in, these increasingly popular knowledge discovery and creation platforms. Your input will help guide us as we consider these opportunities.

We are aware of other publishers who have entered into monetization arrangements without informing their authors. At the MIT Press, however, we believe that the input from our authors will better inform our decision-making as we explore these opportunities.

We therefore invite your perspective on whether you believe that content you have authored should be used to train generative AI systems and whether the MIT Press’s practices in this area would influence your choice of the Press as a publishing partner going forward. Please provide your perspectives here. Note, this informal survey is not intended as a mechanism for opting in or out of an LLM licensing arrangement. Rather, it is just an open-ended form for sharing your views, either anonymously or under your own name.

We are hoping to collect as many responses as possible in the next two weeks, ideally before Friday November 22.

Many thanks for your input,
The MIT Press team”

“We have thus far not entered into any licensing deals and don’t know yet if we will,” Amy Brand, director and publisher at MIT Press, told 404 Media. “What we do know is that content that we publish is being used whether we grant permission for this purpose or not. If we did proceed with any licensing deals in future, our authors would have to intentionally opt in of their own accord. We would also have to figure out how to compensate these authors appropriately and to ensure that their works are appropriately attributed in the outputs of any LLM partner.”

The email is soliciting opinions from authors on whether they think their work should be used to train generative AI systems, and links to a Google form where authors can answer five questions about their views on generative AI in publishing. The questions include “Do you believe that works you have authored should be used to train generative AI systems? How do publisher practices in this area impact your own choice of publishing partners?” and “If the MIT Press enters into any such licensing deals that have the potential to include your own work(s), would you expect to be compensated, as per other digital licensing partnerships according to your publishing contract?”

AI Companies Are Trying to Get MIT Press Books — Screenshot of the questions contained in the MIT Press survey

Brand told 404 Media that the response to the email has been “very strong,” with about 800 survey responses so far out of 5,600 MIT Press authors emailed. “Many of the author responses that we’ve received thus far express a range of concerns about commercial LLM systems—from producing untrustworthy and unverifiable outputs to cannibalizing their own work and the publishing economy – and other creative economies—more generally. On the other hand, there are also several responses from authors who want us to make sure their work is trained on given the fast-growing adoption of these systems,” Brand said.

MIT Press sent out the request for input because “we really do want to know what our authors are thinking and how they would like the MIT Press to proceed in the area of LLM training data,” Brand said. “We plan to analyze the data and base our decisions on what we learn. We may also share a synthesis of the data publicly (fully anonymized, of course) because there is so much interest in the publishing world about how authors are thinking about these complex questions.”

Aram Sinnreich, who along with co-author Jesse Gilbert published The Secret Life of Data: Navigating Hype and Uncertainty in the Age of Algorithmic Surveillance with MIT Press in April, told 404 Media that he received the email, and replied saying he would withhold his consent for using it for training AI.

“I told them that the problem with using our work to train LLMs isn’t that individual authors deserve to be compensated, or given due credit, for their work being churned through the AI grinder,” Sinnreich said. “Rather, it’s a structural problem in which the labor of working scholars en masse is being used to feed the profits of insatiably greedy tech elites, effecting a massive upwards transfer of wealth while simultaneously undermining the role of expertise and the value of individual perspective in the production of knowledge, which has widespread civic and cultural consequences.”

Greg Epstein, author of Tech Agnostic: How Technology Became the World’s Most Powerful Religion, and Why It Desperately Needs a Reformation which was published by MIT Press in October, said he was glad to see the publisher soliciting input. “I’m not a fan of publishers of any kind just giving their rights to be used in generative AI,” he said. “I don’t think that that is likely to be a good thing in the long run for authors, or for readers, or for humanity.”

MIT Press is a non-profit university press, meaning many of its authors are also full-time scholars and academics, and their books go through a rigorous multi-round peer-review process. Presumably, these books would be valuable for use in training LLMs, because they’re often technical and focus on specific topics.

MIT Press added a line to the front matter of its books in August 2023 specifically prohibiting them from being used to train AI without permission: “No part of this book may be used to train artificial intelligence systems or reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.”

Penguin Random House, one of the biggest book publishers in the world, introduced a similar line across all imprints globally, confirming to The Bookseller last month that it will appear “in imprint pages across our markets” in all new titles and backlist titles that are reprinted. Penguin Random House’s books will include the line “No part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems.” (MIT Press uses Penguin Random House Publishing Services for the worldwide sales and distribution of its books.)

Other major publishers, however, are selling their authors’ works to tech companies allegedly without telling them. In July, academic publisher Taylor & Francis, which owns Routledge, sold access to its authors’ research as part of a $10 million deal with Microsoft. Authors said they found out their work had been sold through “word of mouth,” trade news outlet The Bookseller reported. A spokesman from the Taylor & Francis group told The Bookseller that the publisher “is providing Microsoft non-exclusive access to advanced learning content and data to help improve relevance and performance of AI systems.”

Oxford University Press and Wiley both also entered into data deals this year. “We are actively working with companies developing large language models (LLMs) to explore options for both their responsible development and usage,” Oxford University Press told The Bookseller. “This is not only to improve research outcomes, but to champion the vital role that researchers have in an AI-enabled world.”

HarperCollins, another one of the biggest publishers globally, confirmed yesterday to 404 Media that it has made a deal with an “artificial intelligence technology company” and is asking some of its authors if they want to opt-in to the agreement.

Last year, a group of authors filed a lawsuit in New York federal court accusing Meta, Microsoft and Bloomberg of using their work to train artificial intelligence systems without their permission, claiming that their works were used as training data in the “Books3” dataset. The dataset contains around 170,000 books, the Atlantic reported in 2023, and was used to train Meta’s LLaMA, Bloomberg’s BloombergGPT, and EleutherAI’s GPT-J. That lawsuit is ongoing. And in August, a group of authors accused AI company Anthropic—which owns the popular Claude chatbots—of training its models on pirated books. The complaint accuses Anthropic of building “a multibillion-dollar business by stealing hundreds of thousands of copyrighted books.”

“The MIT Press is known to be a strong supporter of open access publishing for scholarly books and journals,” Brand told 404 Media. “But the law is not yet clear on whether openly published works—let alone paywalled works—are fair game for LLM training purposes, unless they are clearly licensed in a way that doesn’t require attribution and doesn’t prevent non-commercial or non-derivative uses. We currently place a notice in our books that says LLM training use is by publisher permission only to make this as clear as possible. Like everyone else in publishing, we await future legal decisions concerning whether LLM training on copyrighted content is deemed a fair use.”

Related Posts