Update README.md

7684592 verified 4 days ago

5.84 kB

metadata

license: other
license_name: embedl-models-community-licence-1.0
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
base_model:
  - meta-llama/Llama-3.2-1B-Instruct
tags:
  - text-generation-inference

Llama-3.2-1B-Instruct-FlashHead-W4A16

Optimized version of Llama-3.2-1B-Instruct using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

FlashHead
Quantization (W4A16)
Custom vLLM generation via embedl-models

FlashHead matches the baseline Llama-3.2-1B within rounding on standard evaluations (MMLU-Pro, HellaSwag, GSM8K, etc.) and, in combination with quantization, achieves H200-level latency on RTX Ada GPUs.

Model Details

Field	Value
Base Model	Llama-3.2-1B-Instruct
Input / Output	Text → Text
Release Date	2025-12-08
Version	1.0
Optimizations	FlashHead LM Head, Quantization (W4A16)
Developers	Embedl
Licenses	Upstream: Meta Llama 3.2 License. Built with Llama. Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use	Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs

Optimizations

FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
Quantization (W4A16) - large reduction in memory footprint and accuracy.
Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

Precision	Tokens/sec	Speedup vs BF16
BF16 baseline	130	1.0×
FlashHead (Embedl)	163	1.25×
W4A16 baseline	278	2.14×
FlashHead W4A16 (Embedl)	485	3.73×

FlashHead improves end-to-end speed by 1.75× over state-of-the-art, while maintaining full accuracy parity.

Accuracy (Parity with Baseline)

Method	MMLU-Pro	HellaSwag	IFEval	BoolQ	BBH	TruthfulQA	GSM8K
Baseline	0.18	0.59	0.45	0.69	0.38	0.36	0.46
FlashHead	0.18	0.59	0.45	0.69	0.38	0.36	0.46

FlashHead matches baseline performance exactly across all evaluation benchmarks.

Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.

Usage Examples

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"

sampling = SamplingParams(max_tokens=128, temperature=0.0)
llm = LLM(model=model_id, trust_remote_code=True)

prompt = "Write a haiku about coffee."
output = llm.generate([prompt], sampling)
print(output[0].outputs[0].text)

Interactive REPL Example

The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16"
asyncio.run(
    run_repl(
        model=model_id
    )

⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.

Limitations

Limited to vLLM 0.10.2 (pinned dependency)
Batch size = 1 (real-time generation)
Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

Huggingface transformers generation
Advanced mixed precision quantization
vLLM CLI benchmarking for detailed latency evaluation
lm-eval-harness integration for detailed accuracy evaluation
Upstream support in Transformers and vLLM
Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
Broader model coverage (larger models, VLMs, VLAs)

License

Upstream: Meta Llama 3.2 License
Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com

Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

Embedl SDK - AI optimization tools & profiling
Embedl HUB - benchmarking platform
Engineering support for on-prem/edge deployments
Migration guidance (Llama / Qwen / Gemma)
Early access & partner co-marketing opportunities

Contact: sales@embedl.com