Apriel-1.6-15b-Thinker: Cost-efficient Frontier Multimodal Performance
We release Apriel-1.6-15b-Thinker, a 15-billion parameter multimodal reasoning model in ServiceNow’s Apriel SLM series which achieves SOTA performance against models 10 times it's size. Apriel-1.6 builds on top of Apriel-1.5-15b-Thinker with an extensive focus on improving text and vision reasoning, while improving token efficiency. This version was trained on NVIDIA DGX™ Cloud with GB200 Grace™ Blackwell Superchips.
Apriel-1.6 scores 57 on the Artificial Analysis Index, outperforming models like Gemini 2.5 Flash, Claude Haiku 4.5 and GPT OSS 20b. It obtains a score on par with Qwen3 235B A22B, while being signficantly more efficient. This new release improves or maintains task performance in comparison with the previous Apriel-1.5-15B-Thinker [1], while reducing reasoning token usage by more than 30%.
Mid-Training
We follow the same overall training process used for Apriel-1.5-15B-Thinker, which includes a depth-upscaling phase followed by two Continual Pretraining (CPT) stages (detailed in [1]). The depth-upscaling corpus consists of 35% data from diverse sources, including high-quality web content, scientific and technical literature, mathematical problem sets, and programming code; 15% high-quality datasets from NVIDIA Nemotron™; and the remaining 50% pretraining-style data serving as replay.
For Apriel-1.6-15B-Thinker, we expand the Stage-1 CPT mixture, which focuses on strengthening textual reasoning and image understanding, with additional text-only samples and image-text pairs. The new text data is fully synthetic, covering general reasoning, knowledge, coding, and creative writing, while the multimodal portion spans document and chart understanding, OCR, visual-reasoning tasks, and SVG/web-code synthesis.
Following Stage-1, we perform a text-only CPT run at an extended 49K sequence length and then run Stage 2 to further refine the model’s visual-reasoning capabilities. This combination produced a strong base model that provided a solid foundation for subsequent post-training. Training for this mid-training pipeline required approximately 10,000 GPU hours on NVIDIA's GB200s, a small compute footprint enabled by their high throughput and aligned with our goal of building strong models with limited resources through careful data strategy and training methodology.
Post-Training
Using the midtrained model, we perform post-training following a pipeline that consists of large scale Supervised Finetuning (SFT) and Reinforcement Learning (RL) targeting both vision and text abilities.
Supervised Finetuning (SFT)
Our Supervised Fine-Tuning (SFT) stage focuses on improving the reasoning quality of Apriel-1.6 by training on a meticulously curated dataset of 2.4 million high-signal text samples. Each example includes explicit, step-by-step reasoning traces, enabling the model to internalize transparent reasoning processes rather than merely reproducing final answers.
To construct this dataset, we combined execution-verifiable synthetic samples for math, coding, and scientific problem-solving with a broad mix of instruction-following, conversational, API/function-calling, creative writing, safety, and other knowledge-intensive samples. Data quality was treated as a first-class priority: every sample passed through multi-stage de-duplication, content filtering, heuristic quality pruning, LLM-as-Judge validation, execution-based verification (where applicable), and strict decontamination against evaluation benchmarks.
SFT was carried out in two phases, both trained at a 32K context length. In the first phase, we ran a large-scale text-only training run on the 2.4M samples for 4 epochs. Compared to Apriel-1.5-15b-Thinker, we simplified the chat template by removing redundant tags and introduced four special tokens to the tokenizer (<tool_calls>, </tool_calls>, [BEGIN FINAL RESPONSE], <|end|>) for easier output parsing.
The second phase was a lightweight, multimodal run trained for 3 epochs, using rejection-sampled data from Apriel-1.5-15b-Thinker to ensure the model maintained strong performance on image inputs after the introduction of these special tokens, while also preparing it for downstream RL stages.
This approach provided us with a robust, high-quality SFT foundation on top of which our RL pipeline could operate effectively. The resulting model exhibits strong multimodal understanding, improved text reasoning capabilities, and enhanced agentic behavior.
Reinforcement Learning (RL)
We adopt a multi-stage RL setup that focuses on simultaneously improving reasoning capability and efficiency. We train the model on image domains such as visual reasoning, general visual question answering (VQA) and optical character recognition (OCR). Our training data also consists of data across different domains, such as simple questions (to encourage short, direct answers on easy queries), math (numerical reasoning), STEM (multiple-choice scientific questions), and function calling (structured tool use).
Rewards are given for correctness of the response, along with penalties for undesirable behaviour, such as verbosity, incorrect formats, etc. Overall, our setup is designed to improve the model’s reasoning ability while using fewer reasoning tokens, encouraging it to avoid unnecessary intermediate steps, stop earlier when confident, and answer more directly for simpler queries.
Training is done with the Group Sequence Policy Optimization loss (GSPO) [2] using the VeRL framework and rule-based verification.
Evaluation
Text Evaluation
We evaluate Apriel-1.6 on various domains such as tool use, math, coding, instruction following and long context.
Text benchmarks included in the Artificial Analysis Index v3.0 use scores reported by Artificial Analysis. All other benchmarks were evaluated internally.
Category Benchmark Apriel-1.6-15B-Thinker Apriel-1.5-15B-Thinker GPT OSS 120B DeepSeek R1 0528 Gemini 2.5 Flash (Sep) GPT 5 mini (high) Claude 4.5 Sonnet (thinking) o3-mini (high) Average Score** 53.22 46.56 52.56 51.92 50.71 62.58 60.37 48.85 Function Calling BFCL v3 only 63.50 51.88 50.62 39.75 39.75 17.62 - 50 Tau2 bench Telecom 69 57.8 66 37 32 68 50.8 31 Tau2 bench Retail 66.67 46.78 61.4 59.94 61.69 73.39 69.8 75.73 Tau2 bench Airline 58 52 45.3 47.33 56.66 59.33 58 61.33 ComplexFuncBench 33.2 19 24.6 24.2 26.3 37.5 24.6 18.9 Instruction Following Agent IF 57.2 55 54.20 52.20 49.70 57.60 54.50 54.90 Multi IF 83.34 76.91 82.95 73.76 82.49 85.37 84.32 87.28 Multi-Challenge 46.15 41.39 46.90 44.50 49.08 57.90 42.49 38.46 IF Bench 69 62 69 40 50 75 57 70.07 Math AIME 25 88 88 93 76 73 91 88 86.67 Coding Struct Eval 79 48.50 71 73 70 69.92 76 73 LCB 81 73 88 77 70 84 71 73 SciCode 37 35 39 40 41 39 45 40 Agentic DeepresearchBench 36.47 32.73 36.30 34.19 38.15 - - 33.40 GAIA 40 30.91 21.21 32.12 47.88 65.45 69.09 23.03 Work-Arena L1 50.2 51.5 50.9 63.9 51.8 65.5 62.7 52.4 OS World Small 16.70 13.90 16.70 25 19.40 22.20 30.60 19.40 SWE Bench Verified 23 16 31 29.60 34.20 61 64.2 22.60 Terminal Bench 14 10 22 15 13 31 33 5.67 Aider Polyglot 37.68 26.37 42 71.40 40 71.60 78 60.40 Knowledge MMLU Pro 79 77 81 85 83 84 88 80 Creative Writing Creative writing v3 / EQ Bench 59.73 60.24 53.70 79.40 74.25 75.25 80.70 30.40 Others GPQA Diamond 73 71 78 81 79 83 83 77 HLE 10 12 18.5 14.9 11.1 19.7 17.3 12.3 Long Context AA LCR 50* 20 51 55 62 68 66 30***
* This score is with DCA enabled. Without this, the model scores 36.
** The average score is calculated using all benchmarks except BFCL v3 Only and DeepResearchBench, since some models do not have scores for these two benchmarks.
*** AA LCR score for o3-mini-high is projected score based on its AA Index score.
Image Evaluation
We evaluate the Apriel-1.6 model on a representative set of evaluations with the prime focus on mathematical reasoning, visual question answering, logical reasoning, STEM related tasks and chart based reasoning. All evaluations are done using VLMEvalkit. Apriel-1.6 improves on its predecessor by 4 points on the average of 13 benchmarks of the Image Index comprising of the following benchmarks: MathVision, MathVista, MMMU (validation), MMMU-Pro (10 choice COT), MMMU-Pro (Vision only COT), MathVerse (Vision Dominant), MathVerse (Text Dominant), MMStar, BLINK, LogicVista, CharXiV (descriptive), CharXiV (reasoning), AI2D (test).
Cost-Efficient Frontier Performance
Apriel-1.6-15B-Thinker sits in the sweet spot of the cost-efficient frontier. It delivers intelligence scores that rival or surpass much larger models while using only 15B parameters. On the chart, it’s firmly inside the most attractive quadrant, balancing efficiency with top-tier reasoning. In practice, this means Apriel-1.6-15B-Thinker offers strong performance and deep reasoning at a fraction of the compute and deployment cost of heavyweight competitors, making it an exceptionally efficient choice for the real-world, especially in enterprise applications.
Our post-training focuses heavily on improving reasoning-token efficiency. The image above showing intelligence score against token usage highlights the effectiveness of our post-training. Apriel-1.6-15B-Thinker again lands in most attractive quadrant. The model reaches a high Artificial Analysis Intelligence Index score while using far fewer tokens than many similarly capable or larger models. In comparison to Apriel-1.5-15b-Thinker [1], we reduce token usage by over 30%.
Overall, Apriel-1.6 is a highly-capable reasoner, that maintains memory and efficiency characteristics required for enterprise deployment.
Acknowledgements
We gratefully acknowledge the following people for their contributions: Varun Pandey, Shashank Maiya, Dhruv Jhamb, Massimo Caccia, Dheeraj Vattikonda, Nicolas Gontier, Patrice Bechard, Tayfun Tuna, Kavya Sriram, Denis Akhiyarov, Hari Subramani, Tara Bogavelli.
Notes and Limitations
We are a small lab with big goals. While we are not GPU poor, our lab, in comparison has a tiny fraction of the compute available to other Frontier labs. Our goal with this work is to show that a SOTA model can be built with limited resources if you have the right data, design and solid methodology.
We set out to build a small but powerful model, aiming for capabilities on par with frontier models. Developing a 15B model with this level of performance requires tradeoffs, so we prioritized getting SOTA-level performance and improving reasoning token efficiency.
This model is trained to perform extensive reasoning for difficult questions and less reasoning effort for simpler questions. We are always actively working to make our models more efficient and concise in future releases.
The model has a few vision-related limitations to be aware of. Complex or low-quality images can reduce OCR accuracy, dense scenes (like crowds or many similar objects) can make subtle details and counting more challenging, and highly detailed or unusually formatted charts may occasionally lead to imperfect interpretations. It may also be less precise with fine-grained visual grounding, so bounding-box predictions can sometimes be approximate or inconsistent.
References
[1] Radhakrishna, S., Tiwari, A., Shukla, A., Hashemi, M., Maheshwary, R., Malay, S.K.R., Mehta, J., Pattnaik, P., Mittal, S., Slimi, K., Ogueji, K., Oladipo, A., Parikh, S., Bamgbose, O., Liang, T., Masry, A., Mahajan, K., Mudumba, S.R., Yadav, V., Madhusudhan, S.T., Scholak, T., Davasam, S., Sunkara, S. and Chapados, N., 2025. Apriel-1.5-15b-Thinker. arXiv preprint arXiv:2510.01141.
[2] Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., Zhou, J. and Lin, J., 2025. Group Sequence Policy Optimization. arXiv preprint arXiv:2507.18071.



