"2024 Artificial Intelligence Index Report" - 1.1 Model Data Exhaustion Risk
AI

"2024 Artificial Intelligence Index Report" - 1.1 Model Data Exhaustion Risk

April 21, 2024

This chapter is somewhat different from my previous understanding.

》, which discussed strategies for using lower-version GPT models to train higher-version models.

In the report's section "Will models run out of data?", it was proposed that whether low-quality or high-quality language data, or even groundbreaking data, will eventually be insufficient to support the training of increasingly larger models.

To address this challenge, many researchers have adopted the approach of training one large language model (LLM) using another LLM and supplementing real data with synthetic data. However, studies indicate that this method has significant flaws: models may lose their ability to remember the true underlying data distribution and start generating outputs with narrow ranges.

The figure below illustrates a trend: models primarily trained on synthetic data become increasingly less diverse in their outputs as generations progress, and their distributions are not broad.

Another study conducted a comparison:

  • Fully synthetic: the model is trained entirely on synthetic data.
  • ): The model is trained using a mixture of synthetic data and real data.

In both cases, the quality of the generated images decreases as the number of training epochs increases.

However, while the synthesis data augmentation loop (with some real data added) shows a lower degree of quality degradation, both methods demonstrate diminishing returns with further training.

ABOUT THE AUTHOR

Renee's Entrepreneurial JourneyEssay Editor

This is my little corner of the internet where I share thoughts, ideas, and interesting stuff I come across in the world of AI. Things in this field move fast, and I use this space to slow down a bit—to reflect, explore, and hopefully spark some good conversations.

GOOGLE

See More