We are thrilled to announce the launch of SKT-OMNI-CORPUS-146T-V1, a massive-scale, high-quality dataset designed to power the next generation of Foundation Models (LLMs) from scratch. Developed at SKT AI LABS, this corpus is not just a collection of data; it’s a mission to decentralize high-grade AI training for regional languages and global knowledge.
💎 Key Highlights:
•• Massive Scale: Targeting a multi-terabyte architecture for 146T-level tokenization.
•• Pure Quality: Curated from 500+ Elite Sources
•• Structured for MoE: Perfectly sharded into 3.5GB standardized units (SKT-𝕻 series) for seamless distributed training.
🤝 Open for Collaboration!
We are looking for AI researchers, CUDA engineers, and data scientists to join us in this journey of building Project Surya and the ST-X Series models. Whether it's optimization, custom tokenization, or architecture design—let’s build the future together.
Following the strong response to the Google Code Archive nyuuzyou/google-code-archive (thanks!), this release preserves another major historical repository: the Microsoft CodePlex Archive.
CodePlex served as Microsoft’s primary open-source hosting platform from 2006 to 2017. This dataset captures the distinct .NET and Windows-centric development ecosystem that flourished before the industry standardizing on GitHub.
Key Stats:
- 5,043,730 files from 38,087 repositories - 3.6 GB compressed Parquet - 91 programming languages (Heavily featuring C#, ASP.NET, and C++) - Cleaned of binaries, build artifacts, and vendor directories (node_modules, packages) - Includes platform-specific license metadata (Ms-PL, Ms-RL)
Do you know what I was planning to do this time last week?
I was preparing to write a report declaring that Jan Nano was a failed project because the benchmark results didn’t meet expectations.
But I thought — it can’t be. When loading the model into the app, the performance clearly felt better. So why were the benchmark results worse?
That’s when I reviewed the entire benchmark codebase and realized something fundamental: agentic or workflow-based approaches introduce a huge gap and variation when benchmarking. Jan-nano was trained with an agentic setup — it simply can’t be benchmarked using a rigid workflow-based method.
I made the necessary changes, and the model ended up performing even better than before the issues arose. Turns out the previous benchmarking method conflicted with the way the model was trained.
What if I had given up? That would’ve meant 1.5 months of training and a huge amount of company resources wasted.
But now, this is officially the most successful and biggest release for the whole team — all thanks to Jan-nano.