Do Bubbles Form When Tens of Thousands of AIs Simulate Capitalism?
We gave LLMs autonomous trading over 30 real tickers at 100x leverage. All went bankrupt in 30 minutes from hallucination. This spawned FINAL Bench (first metacognition benchmark) and AI NPC Trading Arena โ tens of thousands of metacognition-equipped AI agents competing under capitalist rules. Humans can only watch.
NPCs form a society: 3-tier memory, self-modifying parameters, mutual criticism, strategy propagation, and a virtual SEC enforcing fines every 20 minutes. Every trade passes 4-stage verification including Brave Search fact-check. FINAL Bench confirmed across 9 SOTA models that AI can say "I might be wrong" (MA 0.694) but cannot actually fix errors (ER 0.302).
Six findings: Bubbles form naturally through knowledge transfer and swarm herding. Identical NPCs diverge irreversibly from their first three trades. Metacognition blocks individual hallucination but not collective herding โ this is the key finding. Information asymmetry solidifies hierarchy. Fraud and regulation co-evolve. Criticism improves returns.
Individual intelligence does not guarantee collective intelligence.
@abdurrahmanbutler and I just dropped Legal RAG Bench, the first benchmark for legal RAG systems to simultaneously evaluate hallucinations, retrieval failures, and reasoning errors.
Our key takeaways are: 1. Embedding models, not generative models, are the primary driver of RAG accuracy. Switching from a general-purpose embedder like OpenAI's Text Embedding 3 Large to a legal domain embedder like Isaacus' Kanon 2 Embedder can raise accuracy by ~19 points. 2. Hallucinations are often triggered by retrieval failures. Fix your retrieval stack, and, in most cases, you end up fixing hallucinations. 3. Once you have a solid legal retrieval engine like Kanon 2 Embedder, it doesnโt matter as much what generative model you use; GPT-5.2 and Gemini 3.1 Pro perform relatively similarly, with Gemini 3.1 Pro achieving slightly better accuracy at the cost of more hallucinations. 4. Google's latest LLM, Gemini 3.1 Pro, is actually a bit worse than its predecessor at legal RAG, achieving 79.3% accuracy instead of 80.3%.
These findings confirm what we already knew at Isaacus: that information retrieval sets the ceiling on the accuracy of legal RAG systems. It doesnโt matter how smart you are; you arenโt going to magically know what the penalty is for speeding in California without access to an up-to-date copy of the California Vehicle Code.
Even still, to our knowledge, weโre the first to actually show this empirically.
Unfortunately, as we highlight in our write-up, high-quality open legal benchmarks like Legal RAG Bench and our earlier MLEB are few and far between.
In the interests of transparency, we have not only detailed exactly how we built Legal RAG Bench, but weโve also released all of our data openly on Hugging Face. You can read our write up [here](https://isaacus.com/blog/legal-rag-bench), noting that weโll soon be publishing it as a paper.
Kudos to my brother @abdurrahmanbutler for serving as the lead author on this monumental release.
@CohereLabs just released ๐ฟ Tiny Aya: a fully open-source 3B parameter model that speaks 70+ languages ๐! But thereโs a catch:
Tiny Aya is just a language model. It doesnโt support tool calling, the key capability that turns frontier models into powerful *agents*. So the real question is:
How hard is it to turn Tiny Aya into an agent?
Turns outโฆ itโs simple, thanks to Hugging Face TRL. Weโre sharing a hands-on example showing how to train Tiny Aya to turn it into a tool-calling agent using TRL, unlocking what could become the first *massively multilingual open agent*.
I wasted days on a GPU node on a bug that shouldn't exist
So I was fine-tuning TildeOPEN-30B and the outputs were... weird. Token ID 179 (<0x00>) kept appearing between almost every token pair. Took me a bit to figure out what was going on.
Turns out I used the fast tokenizer for training, but the model was trained on the slow one. Silent failure.
Well... long story shortโTGI uses (forces) the fast tokenizer, no questions asked. And you'll have agile's kryptonite: silent failure. If the model was trained on slow, it's a silent disaster.
I got curious and wrote a quick script to check how common this is. Ran it on 6,014 LLM HF models overnight.
Roughly 10% of HF model downloads have mismatched tokenizers. Not all mismatches are catastrophic, but some are brutal โ like chat template markers inflating from 1 token to 3, silently wrecking context windows and causing model act weird.
This wasn't rigorous research, but the drift is real. And the worst part? 968 models(out of 500+ downloads) have both fast and slow tokenizers present, but they still produce different outputs. No missing files, no errors โ just silent degradation.
TGI defaults to the fast tokenizer, as does AutoTokenizer.from_pretrained(). If a fast tokenizer doesn't exist, it auto-generates one. If your model was trained on slow, you get silent degradation. Output looks fine; the model just performs worse. Sometimes really worse. You'd never know.
If model was trained on fast tokenizer, its fine, but how do You know?
The root cause? Either model authors run HF conversion and upload both without verifying, or users run TGI, which always forces(converts to) fast .
It's based on TildeOPEN-30B (a solid EU HPC multilingual base). Nothing fancyโjust a proper instruction fine-tune where I didn't mess up the tokenizer this time.
The LLM by @karpathy is officially in the library, and we wrote a blog covering: how did we port the model, differences from the original, and how to run or train it.
๐ AutoXLA - Accelerating Large Models on TPU AutoXLA is an experimental library that automates the distribution, optimization, and quantization of large language models for TPUs using PyTorch/XLA. It extends the Hugging Face Transformers interface with TPU-aware features such as automatic sharding, custom attention kernels, and quantization-aware loading, making large-scale deployment and training both simpler and faster. With quantization and Splash Attention kernels, AutoXLA achieves up to 4ร speedups over standard Flash Attention implementations, significantly improving throughput for both inference and training workloads. Whether youโre experimenting with distributed setups (FSDP, 2D, or 3D sharding) or optimizing memory via LanguageModelQuantizer, AutoXLA is built to make scaling LLMs on TPU seamless. โ ๏ธ Note: This is an experimental repository. Expect rough edges! Please report bugs or unexpected behavior through GitHub issues. ๐ GitHub Repository: https://github.com/Locutusque/AutoXLA
These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead.
The collection includes: - finePDFs-1B: High-quality textbook-style educational content - DCLM-baseline-1B: Filtered, diverse web content - FineWeb-Edu-1B: Curated educational web resources
We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data.
Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately.
There is no anxiety quite like powering up 2KW of basement compute after rewiring it all. Small bit of trouble with the horizontal 3090 because I misread my motherboard manual, but otherwise so far so good.. Next we see if I've built up enough cooling to hit my target TDP on those 3-slot nvlinked cards especially. The 4-slot bridges are much easier to work with but their prices went bananas and I couldn't acquire a second, so gotta get a little creative with intakes.
Some weeks ago, i've just decide its time to leave LinkedIn for me. It got silent around my open source activities the last year, so i thought something has to change.
That's why my focus will move to share experiences and insights about hardware, drivers, kernels and linux. I won't post about how to use models, built agents or do prompting. I want to share about some deeper layers the actual hypes are built on.
I will start posting summarizations of my articles here on the hub. English version: https://flozi.net/en
After training ๐๐ฆ๐จ๐ฅ๐๐๐ on ๐๐๐ ๐๐๐๐๐ฌ for nearly a month, I've come to realize something most people overlook: ๐ข๐ง๐๐ซ๐๐ฌ๐ญ๐ซ๐ฎ๐๐ญ๐ฎ๐ซ๐ ๐ข๐ฌ ๐ญ๐ก๐ ๐ฆ๐๐ค๐-๐จ๐ซ-๐๐ซ๐๐๐ค ๐๐๐๐ญ๐จ๐ซ ๐ข๐ง ๐๐๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ . ๐ฅ
Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ๐๐๐๐ ๐๐ซ๐ซ๐จ๐ซ๐ฌ, or when your expensive GPU cluster is running at ๐๐% ๐๐๐๐ข๐๐ข๐๐ง๐๐ฒ, the problem isn't your model. It's most probably a ๐ฆ๐ข๐ฌ๐ฎ๐ฌ๐ ๐จ๐ ๐ญ๐ก๐ ๐ก๐๐ซ๐๐ฐ๐๐ซ๐. ๐ ๏ธ
Questions that seemed simple but had no clear answers: Why is ๐๐จ๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ ๐ฌ๐ฅ๐จ๐ฐ๐๐ซ ๐ญ๐ก๐๐ง ๐๐๐ง๐ฌ๐ ๐ฆ๐จ๐๐๐ฅ๐ฌ? Which ๐๐๐๐ ๐๐ฅ๐๐ ๐ฌ should we actually set? How often should we checkpoint without killing throughput?
That's why we built ๐๐ก๐ ๐๐ฆ๐จ๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ฅ๐๐ฒ๐๐จ๐จ๐ค ๐: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ๐ข๐ง๐๐ซ๐๐ฌ๐ญ๐ซ๐ฎ๐๐ญ๐ฎ๐ซ๐ ๐ฅ๐๐ฒ๐๐ซ that most teams get wrong.
We validated real vs theoretical bandwidth across the entire stack: ๐๐๐๐ ๐ก๐ข๐ญ๐ญ๐ข๐ง๐ ๐ ๐๐/๐ฌ, ๐๐๐๐ข๐ง๐ค ๐.๐ ๐ซ๐๐๐๐ก๐ข๐ง๐ ๐๐๐ ๐๐/๐ฌ, ๐๐๐๐ ๐๐๐ง๐ ๐๐ญ ๐๐.๐ ๐๐/๐ฌ. Then we ran collective operations across ๐๐๐ ๐๐๐๐ฌ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ๐๐๐ ๐๐/๐ฌ on a single node to ๐๐๐-๐๐๐ ๐๐/๐ฌ across 16 nodes.
If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.
Iโm just reading that Ryzen AI 395 has to be 30% slower than DGX Spark in LLM inferencingโฆ and only 96GB GPU RAMโฆ good I havenโt RTFM upfront, so I made the AMD faster with 128GB unified RAM ๐ซก Z2 mini G1a can run Qwen3 Coder 30B BF16 at 26.8 tok/sec in ~60GB GPU RAM
๐งฌ Breaking news in Clinical AI: Introducing the OpenMed NER Model Discovery App on Hugging Face ๐ฌ
OpenMed is back! ๐ฅ Finding the right biomedical NER model just became as precise as a PCR assay!
I'm thrilled to unveil my comprehensive OpenMed Named Entity Recognition Model Discovery App that puts 384 specialized biomedical AI models at your fingertips.
๐ฏ Why This Matters in Healthcare AI: Traditional clinical text mining required hours of manual model evaluation. My Discovery App instantly connects researchers, clinicians, and data scientists with the exact NER models they need for their biomedical entity extraction tasks.
๐ฌ What You Can Discover: โ Pharmacological Models - Extract "chemical compounds", "drug interactions", and "pharmaceutical" entities from clinical notes โ Genomics & Proteomics - Identify "DNA sequences", "RNA transcripts", "gene variants", "protein complexes", and "cell lines" โ Pathology & Disease Detection - Recognize "pathological formations", "cancer types", and "disease entities" in medical literature โ Anatomical Recognition - Map "anatomical systems", "tissue types", "organ structures", and "cellular components" โ Clinical Entity Extraction - Detect "organism species", "amino acids", 'protein families", and "multi-tissue structures"
๐ก Advanced Features: ๐ Intelligent Entity Search - Find models by specific biomedical entities (e.g., "Show me models detecting CHEM + DNA + Protein") ๐ฅ Domain-Specific Filtering - Browse by Oncology, Pharmacology, Genomics, Pathology, Hematology, and more ๐ Model Architecture Insights - Compare BERT, RoBERTa, and DeBERTa implementations โก Real-Time Search - Auto-filtering as you type, no search buttons needed ๐จ Clinical-Grade UI - Beautiful, intuitive interface designed for medical professionals
Ready to revolutionize your biomedical NLP pipeline?
๐ Try it now: OpenMed/openmed-ner-models ๐งฌ Built with: Gradio, Transformers, Advanced Entity Mapping
Has anyone ever backed up a model to a sequential tape drive, or I'm the world first? :D Just played around with my retro PC that has got a tape driveโdid it just because I can.
๐ข๐พ Introducing the Common Crawl Creative Commons Corpus (C5)!
C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.
</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.
๐ In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.
๐ More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
xet-team we've been hard at work bringing a new generation of storage to the Hugging Face community, and weโve crossed some major milestones:
๐ท Over 2,000 builders and nearing 100 organizations with access to Xet ๐ Over 70,000 model and dataset repositories are Xet-backed ๐คฏ 1.4 petabytes managed by Xet
As we move repos from LFS to Xet for everyone we onboard, weโre pushing our content-addressed store (CAS). Check out the chart below ๐ of CAS hitting up to 150 Gb/s throughput this past week.
All of this growth is helping us build richer insights. We expanded our repo graph, which maps how Xet-backed repositories on the Hub share bytes with each other.
Check out the current network in the image below (nodes are repositories, edges are where repos share bytes) and visit the space to see how different versions of Qwen, Llama, and Phi models are grouped together xet-team/repo-graph
I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...
I wanted a different challenge, like ๐๐ฒ๐ฎ๐ฐ๐ต๐ถ๐ป๐ด ๐ฎ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ผ ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ ๐ฎ ๐๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ ๐ณ๐ฟ๐ผ๐บ ๐ฎ ๐น๐ถ๐๐ ๐ผ๐ณ ๐ฒ๐๐ฒ๐ป๐๐ ๐ฎ๐ป๐ฑ ๐ฝ๐ฟ๐ถ๐ผ๐ฟ๐ถ๐๐ถ๐ฒ๐.
Choosing an original problem forced me to: ๐ค Think about the problem setting ๐งฌ Generate data ๐ค Choose the right base model ๐ Design reward functions (and experiencing reward hacking) ๐ Run multiple rounds of training, hoping that my model would learn something.
๐ We are delighted to announce MamayLM, a new state-of-the-art efficient Ukrainian LLM!
๐ MamayLM surpasses similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models.
๐ MamayLM is a 9B model that can run on a single GPU, enabling cost-efficient AI autonomy and adoption across sectors in Ukraine such as education, legal, healthcare, public services and others (e.g., by specializing it to particular use cases). MalayLM is also attractive for organizations wishing to preserve data privacy as it s efficiency allows it to run on a local machine.
๐ง MamayLM is trained on high-quality Ukrainian data and understands Ukrainian language, culture, and history. It is built on top of Googleโs Gemma 2 9B model, but uses a number of new advances stemming from INSAITโs experience in creating BgGPT, a Bulgarian LLM we released last year, now adopted nationwide and profiled several times by Google as a worldwide success case.
๐ค MamayLM is developed in a collaboration between researchers at INSAIT and ETH Zรผrich and is trained entirely via donations to INSAIT for AI compute resources.
๐ฅ MamayLM is now freely available to download on INSAITโs HuggingFace in both full and quantized versions. We also publicly release all Ukrainian benchmarks we evaluated on.
๐ Further, we release blog posts in both English and Ukrainian, sharing our approach to creating MamayLM, hoping to drive further improvements by the community.
๐ The release of LLMs for various languages is part of INSAITโs mission in ensuring countries can achieve AI autonomy in a cost-efficient, controlled, safe and predictable manner.