There’s a new trend in the air, spreading with rapid speed across the digital world: Generative Artificial Intelligence (AI), represented by platforms such as ChatGPT for text and Stable Diffusion for images, is now available to a wider audience. While the potential of these technologies is rendering our digital experience exciting and diverse, it also carries certain risks that have been highlighted by a team of researchers.
We’re living in an era where AI-generated content is inevitably populating the internet. A parallel phenomenon is the practice of AI companies combing through the internet for freely available data to train their language and image models. However, as emphasized in a study by Cornell University, there lies a tangible danger when the data used to train these models are produced by the models themselves.
Let’s delve deeper into the world of “model collapse,” a phenomenon that happens when AI models train on their own outputs in an endless loop. What happens exactly? In the first round, these models lose a portion of the actual information about the world. But with every subsequent generation, they begin mixing the remaining information from the real world with those they’ve created themselves. The outcome is a steadily increasing distortion of reality. A text-based AI trained in this manner could end up sounding less and less human-like – an unwanted outcome that runs counter to the original intent.
So, what’s the solution, you ask? Fresh, human-generated data is the keyword. That, however, is easier said than done, as it’s often unclear whether internet data are human or machine-generated. Researchers underscore the need to ensure access to the original data models were trained on and regularly refresh the data pool with new, non-AI-generated data.
This calls for a synchronized effort from AI communities and companies to create clarity on which data are of human origin and which are produced by AI models. Unless such measures are undertaken promptly, developing new AI models trained on genuine human data could become an increasingly challenging task. Our digital future hinges on how we master this challenge.