WilhelmT commited on
Commit
126857a
·
verified ·
1 Parent(s): 844dc16

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -18,7 +18,7 @@ Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
18
  - Quantization (W4A16)
19
  - Custom vLLM generation via `embedl-models`
20
 
21
- FlashHead matches the baseline **Llama-3.2-1B** within rounding on standard evaluations (MMLU-Pro, HellaSwag, GSM8K, etc.) and, in combination with quantization, achieves **H200-level latency** on **RTX Ada** GPUs.
22
 
23
  ---
24
 
 
18
  - Quantization (W4A16)
19
  - Custom vLLM generation via `embedl-models`
20
 
21
+ FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.
22
 
23
  ---
24