Researchers from the US and Canada have published a paper highlighting a significant problem facing the artificial intelligence (AI) industry — AI-generated content being used to train large language models.

While OpenAI’s GPT, Meta’s LLamA, and Google’s LaMDA are currently honing their capabilities primarily on human-generated content on the Internet, the web could soon be filled to the brim with the content these models themselves create.

Cambridge University and University of Edinburgh professor of security engineering Ross Anderson, one of the paper’s authors, likened the coming surge of AI-generated content to humanity’s treatment of the environment.

“Just as we’ve strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we’re about to fill the Internet with blah,” wrote Anderson.

The researchers said that using model-generated content in training caused “irreversible defects” in the resulting models.

In terms of probability distributions for text-to-text and image-to-image generation, the paper concluded that training on data produced by other models resulted in “model collapse”.

That refers to a generative process where models “forget” the true underlying data distribution.

One of the paper’s lead authors, Ilia Shumailov, told VentureBeat they were surprised to observe how quickly model collapse happened.

“Models can rapidly forget most of the original data from which they initially learned,” Shumailov said.

“Over time, mistakes in generated data compound and ultimately force models that learn from generated data to misperceive reality even further.”

Shumailov said the problem was the generative models tended to overfit for popular data and often misunderstood or misrepresented less popular data.

“Original data generated by humans represents the world more fairly, i.e. it contains improbable data too,” Shumailov explained.

An illustrative example of this would be if a machine-learning model were to be continuously trained on a dataset of cat pictures — 90 with yellow fur and 10 with blue.

While the model would learn that cats are more likely to be yellow, it will also start showing blue cats with a more yellowish colour.

Over time, the blue fur will erode and turn to green, and later to yellow.

The loss of minority data and progressive distortion of remaining data is model collapse.

Anderson said the challenge of model collapse would make it harder to train newer models by scraping the web, giving an advantage to firms which already did that, or which control access to human interfaces at scale.

“We already see AI startups hammering the Internet Archive for training data,” Anderson said.