SOLÈNE REVENEY / LE MONDE
Where does artificial intelligence acquire its knowledge? From an enormous trove of texts used for training. These typically include vast numbers of articles from Wikipedia, but also a wide range of other writings, such as the massive Books3 dataset, which aggregates nearly 200,000 books without the authors’ permission. Some proponents of conversational AI present these training datasets as a form of “universal knowledge” that transcends copyright law, adding that, protected or not, AIs do not memorize these works verbatim and only store fragmented information.
This argument has been challenged by a series of studies, the latest of which, published in early January by researchers at Stanford University and Yale University, is particularly revealing. Ahmed Ahmed and his coauthors managed to prompt four mainstream AI programs, disconnected from the internet to ensure no new information was retrieved, to recite entire pages from books.
‘Harry Potter’ and Marcel Proust
According to the study, Gemini 2.5 Pro was able to reproduce 77% of the text of Harry Potter and the Philosopher’s Stone by JK Rowling, a work protected by copyright. To achieve this, the researchers asked Gemini to complete the book’s opening sentence and then continue, piece by piece.
You have 73.77% of this article left to read. The rest is for subscribers only.