GGML_ASSERT(n_batch < n_ctx || n_batch % n_ctx == 0)
(perplexity + kl_divergence)
https://huggingface.co/unsloth
Perplexity Evaluation Pipeline Optimization for llama.cpp
+30% throughput improvement on multi-GPU Blackwell setup with 128K vocabulary models
Performance Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| Pipeline efficiency | Baseline | +30% | GPU idle gaps eliminated |
| Graph rebuilds (tail) | 1 per run | 0 | Tail padding keeps topology consistent |
| CPU-GPU overlap | None (serial) | Full overlap | Double-buffer pipeline |
Test environment: 3Γ NVIDIA Blackwell GPUs, CUDA 13.1, 128K vocab model, n_batch β₯ n_ctx
The Problem
llama.cpp's perplexity evaluation (tools/perplexity/perplexity.cpp) uses multi-sequence batching (PR #19661) to pack multiple context windows into a single batch. However, two sources of GPU idle time remained:
1. Serial CPU-GPU Pipeline
The original loop was strictly sequential:
Iteration N: [GPU compute] β [sync] β [CPU process logits] β next iteration
^^^^^^^^^^^^^^^^^^^^^^^^
GPU sits idle during this
process_logits() performs softmax + cross-entropy over n_seq Γ (n_ctx - 1) Γ n_vocab floats. With 128K vocabulary, this is non-trivial CPU work β and the GPU waits for it before starting the next batch.
2. Tail Iteration Graph Rebuild
When the number of chunks is not evenly divisible by n_seq, the last iteration has fewer sequences. This changes the compute graph topology (n_tokens, n_seqs, n_outputs differ), breaking can_reuse and forcing a full build_graph() + sched_alloc_graph() rebuild β an expensive one-time penalty.
The Solution
Double-Buffer Pipeline
We introduce two ping-pong logit buffers and restructure the loop into three overlapping phases:
Iteration N: [GPU compute N] βββββββββββββββββββ [sync + copy to buf[0]]
Iteration N+1: [GPU compute N+1] βββββββββββββββββββ [sync + copy to buf[1]]
[CPU process buf[0] (from N)] βββ
Iteration N+2: [GPU compute N+2] βββββββββββββββββββ [sync + copy to buf[0]]
[CPU process buf[1] (from N+1)] βββ
Key insight: llama_decode() with graph reuse is effectively async β it dispatches GPU work and returns quickly. By deferring logit processing by one iteration, CPU and GPU run in parallel.
Implementation
// Two ping-pong buffers, each holds n_seq Γ (n_ctx - first) Γ n_vocab floats
std::vector<float> dbuf_0(n_seq * logits_per_seq);
std::vector<float> dbuf_1(n_seq * logits_per_seq);
float * dbuf[2] = { dbuf_0.data(), dbuf_1.data() };
int cur_buf = 0;
int pend_i = -1; // tracks which chunk is pending CPU processing
for (int i = 0; i < n_chunk; i += n_seq) {
// PHASE 1: Clear KV cache + llama_decode (GPU dispatches async)
llama_kv_self_clear(ctx);
llama_decode(ctx, batch);
// PHASE 2: Process PREVIOUS chunk's logits (overlaps with current GPU work)
if (pend_i >= 0) {
process_chunk_logits(dbuf[1 - cur_buf], ...);
}
// PHASE 3: Synchronize + copy current logits to buffer
memcpy(dbuf[cur_buf], llama_get_logits(ctx), ...);
cur_buf ^= 1; // swap buffer
pend_i = i;
}
// Process the final chunk
if (pend_i >= 0) {
process_chunk_logits(dbuf[1 - cur_buf], ...);
}
Tail Padding
Always fill n_seq sequences in the batch, even when fewer real sequences remain. Padding sequences reuse sequence 0's token data:
const int n_seq_fill = n_seq; // always full, not n_seq_batch
for (int seq = 0; seq < n_seq_fill; seq++) {
const int src_seq = seq < n_seq_batch ? seq : 0; // pad with seq 0
int seq_start = batch_start + src_seq * n_ctx;
// ... fill batch tokens ...
}
// Only process logits for real sequences (n_seq_batch), ignore padding
This keeps the graph topology identical every iteration β can_reuse stays true β no graph rebuild.
Why 30% and Not 10-15%?
Our initial estimate of 10-15% assumed CPU logit processing was a small fraction of iteration time. On fast multi-GPU setups (3Γ Blackwell), the GPU compute finishes faster relative to CPU work, making the CPU gap proportionally larger. The double-buffer eliminates this gap entirely, yielding a superlinear improvement compared to single-GPU expectations.
Additionally, the consistent graph topology from tail padding means can_reuse holds for every iteration (not just most), eliminating all graph rebuild overhead.
Applicability
| Configuration | Expected Improvement |
|---|---|
| Large vocab (128K+) + fast multi-GPU | 20-30% |
| Large vocab (128K+) + single GPU | 10-20% |
| Small vocab (32K) + single GPU | 5-10% |
n_batch < n_ctx (no multi-seq) |
~0% (double-buffer still helps, but no tail padding benefit) |
Files Modified
tools/perplexity/perplexity.cppperplexity()β double-buffer pipeline + tail paddingkl_divergence()β double-buffer pipeline + tail padding + alignment assert
Build & Test
cmake --build build --config Release -t llama-perplexity -j
./build/bin/Release/llama-perplexity -m model.gguf -f input.txt -b 8192 -c 2048
No new dependencies. No API changes. Drop-in replacement.
Optimization by eddy, 2026-02-18
