We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co/blog/cosmopedia

Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover many topics with few duplicates.
📚 You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
⚙️ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!

1 reply

upvoted a paper over 2 years ago

NExT-GPT: Any-to-Any Multimodal LLM

Paper • 2309.05519 • Published Sep 11, 2023 • 78

liked a model over 2 years ago

tiiuae/falcon-180B

Text Generation • 180B • Updated Sep 6, 2023 • 141 • 1.15k

M Almenea

AI & ML interests

Recent Activity

Organizations

malmenea's activity

The Ultra-Scale Playbook