embedl
/

Llama-3.2-1B-Instruct-FlashHead-W4A16

flash_head_llama

text-generation-inference

compressed-tensors

Model card Files Files and versions

WilhelmT commited on Dec 8, 2025

Commit

579b0dd

·

verified ·

1 Parent(s): 5a0c6b9

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -81,6 +81,7 @@ The `embedl-models` package is required, it provides the optimized FlashHead imp
 ---
 ## Usage Examples
 ### vLLM Inference
@@ -92,7 +93,7 @@ model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"
 if __name__ == "__main__":
     sampling = SamplingParams(max_tokens=128, temperature=0.0)
-    llm = LLM(model=model_id, trust_remote_code=True)
     prompt = "Write a haiku about coffee."
     output = llm.generate([prompt], sampling)
@@ -115,7 +116,8 @@ model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"
 if __name__ == "__main__":
     asyncio.run(
         run_repl(
-            model=model_id
         )
     )
 ```

 ---
 ## Usage Examples
+**Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`).
 ### vLLM Inference
 if __name__ == "__main__":
     sampling = SamplingParams(max_tokens=128, temperature=0.0)
+    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
     prompt = "Write a haiku about coffee."
     output = llm.generate([prompt], sampling)
 if __name__ == "__main__":
     asyncio.run(
         run_repl(
+            model=model_id,
+            max_model_len=131072
         )
     )
 ```