As AI gains popularity, researchers caution that the industry could face a shortage of training data, which powers powerful AI systems. This shortage could impede the development of AI models, particularly large language models, and potentially change the course of the AI revolution.
Despite the vast amount of data available on the web, the potential scarcity of data poses a significant concern. As a result, addressing this risk is crucial for the continued advancement of AI technology.
It is important to note that substantial amount of data is necessary to effectively train advanced, precise, and top-notch AI algorithms. For example, ChatGPT underwent training using 570 gigabytes of text data, equivalent to approximately 300 billion words.
The stable diffusion algorithm, used in AI image-generating apps like DALL-E, Midjourney, and Lensa, was trained on the LIAON-5B dataset containing 5.8 billion image-text pairs. Insufficient training data leads to inaccurate or low-quality outputs. Additionally, the quality of the training data is crucial. While low-quality data like social media posts or unclear photos are readily available, they are inadequate for training high-performance AI models.
It is vital to note that text from social media platforms may contain bias, prejudice, disinformation, or illegal content, which could be perpetuated by the AI model. For instance, Microsoft’s attempt to train its AI bot using Twitter content resulted in the generation of racist and misogynistic outputs.
Therefore, AI developers prefer high-quality content from books, online articles, Wikipedia, scientific papers, and filtered web content.
The AI industry relies on expanding datasets for training high-performing models like ChatGPT and DALL-E 3. However, researchers predict a shortage of high-quality text data by 2026, low-quality language data by 2030-2050, and low-quality image data by 2030-2060. This data scarcity may impede AI’s development despite its potential $15.7 trillion contribution to the global economy by 2030.
AI enthusiasts may be concerned about future developments, but there’s hope. Uncertainties persist, but improving algorithms for more efficient data use could lead to high-performing AI with less data and lower environmental impact.