"2024 Artificial Intelligence Index Report" - 1.1 Model Data Exhaustion Risk

April 21, 2024

This chapter is somewhat different from my previous understanding.

》, which discussed strategies for using lower-version GPT models to train higher-version models.

In the report's section "Will models run out of data?", it was proposed that whether low-quality or high-quality language data, or even groundbreaking data, will eventually be insufficient to support the training of increasingly larger models.

To address this challenge, many researchers have adopted the approach of training one large language model (LLM) using another LLM and supplementing real data with synthetic data. However, studies indicate that this method has significant flaws: models may lose their ability to remember the true underlying data distribution and start generating outputs with narrow ranges.

The figure below illustrates a trend: models primarily trained on synthetic data become increasingly less diverse in their outputs as generations progress, and their distributions are not broad.

Another study conducted a comparison:

Fully synthetic: the model is trained entirely on synthetic data.
): The model is trained using a mixture of synthetic data and real data.

In both cases, the quality of the generated images decreases as the number of training epochs increases.

However, while the synthesis data augmentation loop (with some real data added) shows a lower degree of quality degradation, both methods demonstrate diminishing returns with further training.

ABOUT THE AUTHOR

Renee's Entrepreneurial JourneyEssay Editor

This is my little corner of the internet where I share thoughts, ideas, and interesting stuff I come across in the world of AI. Things in this field move fast, and I use this space to slow down a bit—to reflect, explore, and hopefully spark some good conversations.

LLM

LangChain reads PDFs (Part 1)

May 7, 2023

From GPT-4 to ChemCrow: How AI is transforming chemical research

August 29, 2024

LLM

Langchain reads PDFs (Part 2)

May 8, 2023

PHILOSOPHY

The rules of studying philosophy

March 29, 2022

GOOGLE

The Smallville sandbox world - A town with 25 virtual residents

May 9, 2023

LLM

AlphaGo and the Power of Reinforcement Learning - Andrej Karpathy's Deep Dive on LLMs (Part 9)

LLMMarch 21, 2025

Reinforcement Learning from Human Feedback (RLHF) - Andrej Karpathy's Deep Dive on LLMs (Part 10)

LLMMarch 22, 2025

The Future of Large Language Models - Andrej Karpathy's In-Depth Explanation of LLM (Part 11)

LLMMarch 23, 2025

GOOGLE

Trial of Google's video generation model VOE2

GOOGLEMarch 23, 2025

Gemini 2.5 Pro, claimed to be far ahead of the competition, has been released with great fanfare: comprehensively surpassing other LLMs and topping the global rankings

GOOGLEMarch 26, 2025

AI-Researcher: LLM-driven全自动 scientific research assistant

GOOGLEMarch 30, 2025