Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,7 @@ Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
|
|
| 18 |
- Quantization (W4A16)
|
| 19 |
- Custom vLLM generation via `embedl-models`
|
| 20 |
|
| 21 |
-
FlashHead matches the baseline
|
| 22 |
|
| 23 |
---
|
| 24 |
|
|
|
|
| 18 |
- Quantization (W4A16)
|
| 19 |
- Custom vLLM generation via `embedl-models`
|
| 20 |
|
| 21 |
+
FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.
|
| 22 |
|
| 23 |
---
|
| 24 |
|