Last year, when academic Jathan Sadowski was looking for an analogy to demonstrate how AI systems deteriorate, he came across the term “Habsburg AI”.
The Habsburgs were one of Europe’s most powerful royal families, but centuries of inbreeding led to the demise of entire branches of their line.
Recent research has demonstrated that AI programs driving products such as ChatGPT experience a similar collapse when regularly given their own data.
“I think the term Habsburg AI has aged very well,” Sadowski told AFP, adding that his coinage has “only become more relevant for how we think about AI systems”.
The ultimate danger is that AI-generated content will take over the web, rendering chatbots and image generators obsolete and causing a trillion-dollar industry to crash.
Other experts, however, believe that the issue is exaggerated or that it is fixable.
Many businesses are excited about utilizing fake data to train AI computers. This artificially generated data is used to supplement or replace real-world information. It is less expensive than human-created content, but more consistent.
“The open question for researchers and companies building AI systems is: how much synthetic data is too much,” said Sadowski, an emerging technologies lecturer at Monash University in Australia.
Training AI programs, also known as large language models (LLMs), entails extracting massive amounts of text or images from the internet.
This data is divided into trillions of small machine-readable units called as tokens.
When posed a question, a software like ChatGPT selects and assembles tokens in such a way that its training data indicates it is the most likely sequence to match the query.
However, even the best AI tools produce falsehoods and nonsense, and opponents have long expressed concern about what would happen if a model was fed its own outputs.
In late July, a paper in the journal Nature titled “AI models collapse when trained on recursively generated data” sparked heated debate.
The scientists noted how models swiftly rejected rarer items in their initial dataset, resulting in “gibberish” outputs, as reported by Nature.
A week later, researchers from Rice and Stanford universities published a paper titled “Self-consuming generative models go MAD” that reached a comparable result.
They examined image-generating AI programs and discovered that as AI-generated data is added to the underlying model, the outputs become more general and stray from desirable elements.
They dubbed model collapse “Model Autophagy Disorder” (MAD) and linked it to mad cow disease, a lethal sickness caused by feeding the remains of deceased cows to living cows.
These academics are concerned that AI-generated text, photos, and videos are depleting the web of usable human-created data.
“One doomsday scenario is that if left uncontrolled for many generations, MAD could poison the data quality and diversity of the entire internet,” Richard Baraniuk, one of the Rice University authors, said in a statement.
However, industry figures are unconcerned.
Anthropic and Hugging Face, two industry leaders who take pride in their ethical approach to technology, told AFP that they employed AI-generated data to fine-tune or filter their datasets.
Anton Lozhkov, a machine learning engineer at Hugging Face, said the Nature research provided an intriguing theoretical perspective, but its doomsday scenario was unrealistic.
“Training on multiple rounds of synthetic data is simply not done in reality,” he told me.
However, he stated that researchers, like everyone else, were irritated by the status of the internet.
“A large part of the internet is trash,” he added, adding that Hugging Face has previously made significant efforts to clean data, often removing up to 90 percent.
He hoped that web users would help clean up the internet by not engaging with created stuff.
“I strongly believe that humans will see the effects and catch generated data way before models will,” stated the engineer.
