image

GGML_ASSERT(n_batch < n_ctx || n_batch % n_ctx == 0)

(perplexity + kl_divergence)

https://huggingface.co/unsloth

Perplexity Evaluation Pipeline Optimization for llama.cpp

+30% throughput improvement on multi-GPU Blackwell setup with 128K vocabulary models

Performance Results

Metric Before After Improvement
Pipeline efficiency Baseline +30% GPU idle gaps eliminated
Graph rebuilds (tail) 1 per run 0 Tail padding keeps topology consistent
CPU-GPU overlap None (serial) Full overlap Double-buffer pipeline

Test environment: 3Γ— NVIDIA Blackwell GPUs, CUDA 13.1, 128K vocab model, n_batch β‰₯ n_ctx

The Problem

llama.cpp's perplexity evaluation (tools/perplexity/perplexity.cpp) uses multi-sequence batching (PR #19661) to pack multiple context windows into a single batch. However, two sources of GPU idle time remained:

1. Serial CPU-GPU Pipeline

The original loop was strictly sequential:

Iteration N:   [GPU compute] β†’ [sync] β†’ [CPU process logits] β†’ next iteration
                                         ^^^^^^^^^^^^^^^^^^^^^^^^
                                         GPU sits idle during this

process_logits() performs softmax + cross-entropy over n_seq Γ— (n_ctx - 1) Γ— n_vocab floats. With 128K vocabulary, this is non-trivial CPU work β€” and the GPU waits for it before starting the next batch.

2. Tail Iteration Graph Rebuild

When the number of chunks is not evenly divisible by n_seq, the last iteration has fewer sequences. This changes the compute graph topology (n_tokens, n_seqs, n_outputs differ), breaking can_reuse and forcing a full build_graph() + sched_alloc_graph() rebuild β€” an expensive one-time penalty.

The Solution

Double-Buffer Pipeline

We introduce two ping-pong logit buffers and restructure the loop into three overlapping phases:

Iteration N:    [GPU compute N]  ──────────────────→  [sync + copy to buf[0]]
Iteration N+1:  [GPU compute N+1] ──────────────────→  [sync + copy to buf[1]]
                [CPU process buf[0] (from N)] ──→
Iteration N+2:  [GPU compute N+2] ──────────────────→  [sync + copy to buf[0]]
                [CPU process buf[1] (from N+1)] ──→

Key insight: llama_decode() with graph reuse is effectively async β€” it dispatches GPU work and returns quickly. By deferring logit processing by one iteration, CPU and GPU run in parallel.

Implementation

// Two ping-pong buffers, each holds n_seq Γ— (n_ctx - first) Γ— n_vocab floats
std::vector<float> dbuf_0(n_seq * logits_per_seq);
std::vector<float> dbuf_1(n_seq * logits_per_seq);
float * dbuf[2] = { dbuf_0.data(), dbuf_1.data() };
int cur_buf = 0;
int pend_i = -1;  // tracks which chunk is pending CPU processing

for (int i = 0; i < n_chunk; i += n_seq) {
    // PHASE 1: Clear KV cache + llama_decode (GPU dispatches async)
    llama_kv_self_clear(ctx);
    llama_decode(ctx, batch);

    // PHASE 2: Process PREVIOUS chunk's logits (overlaps with current GPU work)
    if (pend_i >= 0) {
        process_chunk_logits(dbuf[1 - cur_buf], ...);
    }

    // PHASE 3: Synchronize + copy current logits to buffer
    memcpy(dbuf[cur_buf], llama_get_logits(ctx), ...);
    cur_buf ^= 1;  // swap buffer
    pend_i = i;
}

// Process the final chunk
if (pend_i >= 0) {
    process_chunk_logits(dbuf[1 - cur_buf], ...);
}

Tail Padding

Always fill n_seq sequences in the batch, even when fewer real sequences remain. Padding sequences reuse sequence 0's token data:

const int n_seq_fill = n_seq;  // always full, not n_seq_batch

for (int seq = 0; seq < n_seq_fill; seq++) {
    const int src_seq = seq < n_seq_batch ? seq : 0;  // pad with seq 0
    int seq_start = batch_start + src_seq * n_ctx;
    // ... fill batch tokens ...
}
// Only process logits for real sequences (n_seq_batch), ignore padding

This keeps the graph topology identical every iteration β†’ can_reuse stays true β†’ no graph rebuild.

Why 30% and Not 10-15%?

Our initial estimate of 10-15% assumed CPU logit processing was a small fraction of iteration time. On fast multi-GPU setups (3Γ— Blackwell), the GPU compute finishes faster relative to CPU work, making the CPU gap proportionally larger. The double-buffer eliminates this gap entirely, yielding a superlinear improvement compared to single-GPU expectations.

Additionally, the consistent graph topology from tail padding means can_reuse holds for every iteration (not just most), eliminating all graph rebuild overhead.

Applicability

Configuration Expected Improvement
Large vocab (128K+) + fast multi-GPU 20-30%
Large vocab (128K+) + single GPU 10-20%
Small vocab (32K) + single GPU 5-10%
n_batch < n_ctx (no multi-seq) ~0% (double-buffer still helps, but no tail padding benefit)

Files Modified

  • tools/perplexity/perplexity.cpp
    • perplexity() β€” double-buffer pipeline + tail padding
    • kl_divergence() β€” double-buffer pipeline + tail padding + alignment assert

Build & Test

cmake --build build --config Release -t llama-perplexity -j
./build/bin/Release/llama-perplexity -m model.gguf -f input.txt -b 8192 -c 2048

No new dependencies. No API changes. Drop-in replacement.


Optimization by eddy, 2026-02-18

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support