Perplexity Evaluation Pipeline Optimization for llama.cpp

+30% throughput improvement on multi-GPU Blackwell setup with 128K vocabulary models

Performance Results

Metric	Before	After	Improvement
Pipeline efficiency	Baseline	+30%	GPU idle gaps eliminated
Graph rebuilds (tail)	1 per run	0	Tail padding keeps topology consistent
CPU-GPU overlap	None (serial)	Full overlap	Double-buffer pipeline

Test environment: 3× NVIDIA Blackwell GPUs, CUDA 13.1, 128K vocab model, n_batch ≥ n_ctx

The Problem

llama.cpp's perplexity evaluation (tools/perplexity/perplexity.cpp) uses multi-sequence batching (PR #19661) to pack multiple context windows into a single batch. However, two sources of GPU idle time remained:

1. Serial CPU-GPU Pipeline

The original loop was strictly sequential:

Iteration N:   [GPU compute] → [sync] → [CPU process logits] → next iteration
                                         ^^^^^^^^^^^^^^^^^^^^^^^^
                                         GPU sits idle during this

process_logits() performs softmax + cross-entropy over n_seq × (n_ctx - 1) × n_vocab floats. With 128K vocabulary, this is non-trivial CPU work — and the GPU waits for it before starting the next batch.

2. Tail Iteration Graph Rebuild

When the number of chunks is not evenly divisible by n_seq, the last iteration has fewer sequences. This changes the compute graph topology (n_tokens, n_seqs, n_outputs differ), breaking can_reuse and forcing a full build_graph() + sched_alloc_graph() rebuild — an expensive one-time penalty.

The Solution

Double-Buffer Pipeline

We introduce two ping-pong logit buffers and restructure the loop into three overlapping phases:

Iteration N:    [GPU compute N]  ──────────────────→  [sync + copy to buf[0]]
Iteration N+1:  [GPU compute N+1] ──────────────────→  [sync + copy to buf[1]]
                [CPU process buf[0] (from N)] ──→
Iteration N+2:  [GPU compute N+2] ──────────────────→  [sync + copy to buf[0]]
                [CPU process buf[1] (from N+1)] ──→

Key insight: llama_decode() with graph reuse is effectively async — it dispatches GPU work and returns quickly. By deferring logit processing by one iteration, CPU and GPU run in parallel.

Implementation

// Two ping-pong buffers, each holds n_seq × (n_ctx - first) × n_vocab floats
std::vector<float> dbuf_0(n_seq * logits_per_seq);
std::vector<float> dbuf_1(n_seq * logits_per_seq);
float * dbuf[2] = { dbuf_0.data(), dbuf_1.data() };
int cur_buf = 0;
int pend_i = -1;  // tracks which chunk is pending CPU processing

for (int i = 0; i < n_chunk; i += n_seq) {
    // PHASE 1: Clear KV cache + llama_decode (GPU dispatches async)
    llama_kv_self_clear(ctx);
    llama_decode(ctx, batch);

    // PHASE 2: Process PREVIOUS chunk's logits (overlaps with current GPU work)
    if (pend_i >= 0) {
        process_chunk_logits(dbuf[1 - cur_buf], ...);
    }

    // PHASE 3: Synchronize + copy current logits to buffer
    memcpy(dbuf[cur_buf], llama_get_logits(ctx), ...);
    cur_buf ^= 1;  // swap buffer
    pend_i = i;
}

// Process the final chunk
if (pend_i >= 0) {
    process_chunk_logits(dbuf[1 - cur_buf], ...);
}

Tail Padding

Always fill n_seq sequences in the batch, even when fewer real sequences remain. Padding sequences reuse sequence 0's token data:

const int n_seq_fill = n_seq;  // always full, not n_seq_batch

for (int seq = 0; seq < n_seq_fill; seq++) {
    const int src_seq = seq < n_seq_batch ? seq : 0;  // pad with seq 0
    int seq_start = batch_start + src_seq * n_ctx;
    // ... fill batch tokens ...
}
// Only process logits for real sequences (n_seq_batch), ignore padding

This keeps the graph topology identical every iteration → can_reuse stays true → no graph rebuild.

Why 30% and Not 10-15%?

Our initial estimate of 10-15% assumed CPU logit processing was a small fraction of iteration time. On fast multi-GPU setups (3× Blackwell), the GPU compute finishes faster relative to CPU work, making the CPU gap proportionally larger. The double-buffer eliminates this gap entirely, yielding a superlinear improvement compared to single-GPU expectations.

Additionally, the consistent graph topology from tail padding means can_reuse holds for every iteration (not just most), eliminating all graph rebuild overhead.

Applicability

Configuration	Expected Improvement
Large vocab (128K+) + fast multi-GPU	20-30%
Large vocab (128K+) + single GPU	10-20%
Small vocab (32K) + single GPU	5-10%
`n_batch < n_ctx` (no multi-seq)	~0% (double-buffer still helps, but no tail padding benefit)

Files Modified

tools/perplexity/perplexity.cpp
- perplexity() — double-buffer pipeline + tail padding
- kl_divergence() — double-buffer pipeline + tail padding + alignment assert

Build & Test

cmake --build build --config Release -t llama-perplexity -j
./build/bin/Release/llama-perplexity -m model.gguf -f input.txt -b 8192 -c 2048

No new dependencies. No API changes. Drop-in replacement.

Optimization by eddy, 2026-02-18

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support