Synthetic Data
Training data generated by models rather than collected from humans.
Synthetic data — model-written solutions, self-play transcripts, distilled answers — is increasingly central to post-training, letting labs scale high-quality examples cheaply. Its main risks are quality control and model collapse if used without care.