We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training. https://huggingface.co/blog/cosmopedia
Here are some key takeaways: ๐ฏ Prompt curation is crucial: we want to cover many topics with few duplicates. ๐ You can leverage various resources for diversity: using different seed data, generation formats, and target audiences. โ๏ธ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.