AI models fed AI-generated data quickly spew nonsense
Training artificial intelligence (AI) models on AI-generated text quickly leads to the models churning out nonsense, a study has found. This cannibalistic phenomenon, termed model collapse, could halt the improvement of large language models (LLMs) as they run out of human-derived training data and as increasing amounts of AI-generated text pervade the Internet.
“The message is, we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. Otherwise, “things will always, provably, go wrong”. he says.” The team used a mathematical analysis to show that the problem of model collapse is likely to be universal, affecting all sizes of language model that use uncurated data, as well as simple image generators and other types of AI.
The researchers began by using an LLM to create Wikipedia-like entries, then trained new iterations of the model on text produced by its predecessor. As the AI-generated information — known as synthetic data — polluted the training set, the model’s outputs became gibberish. The ninth iteration of the model completed a Wikipedia-style article about English church towers with a treatise on the many colours of jackrabbit tails.
More subtly, the study, published in Nature on 24 July, showed that even before complete collapse, learning from AI-derived texts caused models to forget the information mentioned least frequently in their data sets as their outputs became more homogeneous.
This is a concern when it comes to making AI models that represent all groups fairly, because low-probability events often relate to marginalized groups, says study co-author Ilia Shumailov, who worked on the project while at the University of Oxford, UK.
“This is a fantastic paper,” says Julia Kempe, a computer scientist at New York University in New York City. Until now, many technology firms have improved their models by feeding them larger and larger amounts of data. But as human-produced content runs out, they are hoping to use synthetic data to keep improving. The study — a version of which first appeared on the arXiv preprint server in May 2023 — has spurred the AI community to try to find solutions to the problem, she says. “It’s been a call to arms.” [Continue reading…]