Susant-Achary's picture
Update README.md
5763d54 verified
---
model-index:
- name: >-
Granite-4.0-H-Tiny — MLX (Apple Silicon), **6-bit** (with guidance for
2/3/4/5-bit)
results: []
license: apache-2.0
language:
- en
tags:
- ibm
- granite
- mlx
- apple-silicon
- mamba2
- transformer
- hybrid
- moe
- long-context
- instruct
- quantized
- 6bit
- MoE
pipeline_tag: text-generation
library_name: mlx
base_model:
- ibm-granite/granite-4.0-h-tiny
---
# Granite-4.0-H-Tiny — **MLX 6-bit** (Apple Silicon)
**Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)
This repository provides an **Apple-Silicon MLX build** of **IBM Granite-4.0-H-Tiny** quantized to **6-bit**.
Among MLX quant variants, **6-bit** offers the **highest fidelity** while still fitting comfortably on modern M-series Macs. If your workload involves **precise extraction, structured outputs, or long contexts**, 6-bit is usually the best on-device choice.
---
## 🔢 Choosing a quantization level (LMX variants)
Use this table as a **practical** guide for a ~7B hybrid MoE LM on Apple Silicon. (Figures vary by device/context.)
| Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to Choose |
|---|---:|:---:|---|---|
| **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest, most lossy | Minimal RAM devices; smoke tests |
| **3-bit** | ~5–6 GB | **🔥🔥🔥🔥** | Direct, concise | Great default on M1/M2/M3/M4 |
| **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | If 3-bit misses details |
| **5-bit** | ~8–9 GB | 🔥🔥☆ | Higher fidelity | Heavier docs/structured outputs |
| **6-bit** *(this repo)* | **~9.5–11 GB** | 🔥🔥 | **Highest MLX fidelity** | Best quality on-device if RAM permits |
**Tips**
- Prefer **6-bit** when you have ~10–12 GB free and want maximum quality.
- Use **3-bit/4-bit** for tighter RAM with good latency and strong baseline quality.
- For JSON/structured extraction, consider **temperature 0.0** and **schema-style prompts**.
---
## 🔎 About Granite 4.0 (context for this build)
- **Architecture:** Hybrid **Mamba-2 + softmax attention**; *H* tiers add **Mixture-of-Experts (MoE)** for sparse activation and efficiency.
- **Model tier:** **H-Tiny** (~7B total params with ~1B active via MoE) — designed for **long-context** use and efficient serving.
- **License:** **Apache-2.0** (permissive, enterprise-friendly).
- **Use cases:** Instruction following, long-context assistants, RAG backends, structured outputs.
> This card documents the **MLX 6-bit** conversion. For lower-RAM devices, see the 2/3/4/5-bit guidance below.
---
## 📦 Contents of this repository
- `config.json` (MLX), `mlx_model*.safetensors` (**6-bit** shards)
- Tokenizer files: `tokenizer.json`, `tokenizer_config.json`
- Any auxiliary metadata (e.g., `model_index.json`)
This build targets **macOS** on **Apple Silicon (M-series)** using **Metal/MPS**.
---
## ✅ Intended use
- **High-fidelity** instruction following and summarization
- **Long-context** reasoning and retrieval-augmented generation (RAG)
- **Structured extraction** (JSON, key–value) and document parsing
- On-device prototyping where **answer faithfulness** matters
## ⚠️ Limitations
- As with any quantization, small regressions vs FP16 can occur (complex math/code).
- **Token limits** and **KV-cache growth** still apply for very long contexts.
- Always add your own **guardrails/safety** for sensitive deployments.
## 🚀 Quickstart (CLI — MLX)
**Deterministic generation**
```bash
python -m mlx_lm.generate \
--model <this-repo-id> \
--prompt "Summarize the following meeting notes in 5 bullet points:\n<your text>" \
--max-tokens 256 \
--temperature 0.0 \
--device mps \
--seed 0