embedl
/

Llama-3.2-1B-Instruct-FlashHead-W4A16

text-generation-inference

compressed-tensors

Model card Files Files and versions

WilhelmT commited on 5 days ago

Commit

844dc16

·

verified ·

1 Parent(s): 579b0dd

Update README.md

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -57,6 +57,10 @@ FlashHead matches the baseline **Llama-3.2-1B** within rounding on standard eval
 FlashHead improves end-to-end speed by **1.75×** over state-of-the-art, while maintaining full accuracy parity.
 ---
 ## Accuracy (Parity with Baseline)

 FlashHead improves end-to-end speed by **1.75×** over state-of-the-art, while maintaining full accuracy parity.
+**Measurement setup:** vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.
+**NVIDIA H200 measurement:** **FP8**, **512 Tokens/sec**.
 ---
 ## Accuracy (Parity with Baseline)