@retrain-pipelines v0.2.0 is out ! I'm at Station F at My booth with GOSIM Paris 2026 today & tomorrow. Come meet me for a live in-person demo and a chat !
1️⃣ Build a solid RL env with Verifiers (Prime Intellect) 2️⃣ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env 3️⃣ SFT warm-up to teach format 4️⃣ Group-based RL (CISPO) against opponents making 20-70% random moves 5️⃣ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies
Local Gemma 4 agent 💎🕵️🗺️ drop in a mysterious map, get the location, live weather, and top spots to visit
I've been exploring what google/gemma-4-E4B-it can do in a local agentic setup and put together a 📓 𝙣𝙤𝙩𝙚𝙗𝙤𝙤𝙠 with Gemma + Haystack AI Framework covering 4 demos.
I initially tried to load all tools from the GitHub MCP server, quickly filling the context available on Colab -> unusable, forgetful agent ❌
Then I used the 𝗦𝗲𝗮𝗿𝗰𝗵𝗮𝗯𝗹𝗲 𝗧𝗼𝗼𝗹𝘀𝗲𝘁 🔎 🧰 It dynamically discovers the right tools from the GitHub MCP server on the fly, loading only what it actually needs for the task at hand, keeping context lean.
Now it actually works.
The notebook also contains 💎 Multimodal weather agent: the mystery map demo above 💎 Visual Question Answering from a paper 💎 RAG on Rock music
It all starts with 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗩𝗲𝗿𝗶𝗳𝗶𝗮𝗯𝗹𝗲 𝗥𝗲𝘄𝗮𝗿𝗱𝘀 - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training
In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)
Consider a more complex tic-tac-toe env ❌⭕ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions
(envs can also include tools)
---
What happens at training?
We use 𝗚𝗿𝗼𝘂𝗽 𝗥𝗲𝗹𝗮𝘁𝗶𝘃𝗲 𝗣𝗼𝗹𝗶𝗰𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 with a tic-tac-toe env
No critic model needed, the group is the baseline Simpler than PPO
1️⃣ Rollout generation: from the same board, model plays N games via sampling 2️⃣ Each game scored with deterministic rewards (win, format, ...) 3️⃣ Mean score computed across the group 4️⃣ Each rollout's advantage = its score minus the group mean 5️⃣ Model updated to favor trajectories above baseline
Earlier this month, Apple introduced Simple Self-Distillation: a fine-tuning method that improves models on coding tasks just by sampling from the model and training on its own outputs with plain cross-entropy
And… it's already supported in TRL, built by Kashif Rasul. you can really feel the pace of development in the team 🐎
Paper by Ruixiang ZHANG, He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang at Apple 🍎
How it works: the model generates completions at a training-time temperature (T_train) with top_k/top_p truncation, then fine-tunes on them with plain cross-entropy. no labels or verifier needed
One neat insight from the paper: T_train and T_eval compose into an effective T_eff = T_train × T_eval, so a broad band of configs works well. even very noisy samples still help