Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,95 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
-
|
|
|
|
| 4 |
tags:
|
| 5 |
-
-
|
| 6 |
-
- granite
|
| 7 |
- mlx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
pipeline_tag: text-generation
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
model-index:
|
| 3 |
+
- name: Granite-4.0-H-Tiny — MLX (Apple Silicon), **6-bit** (with guidance for 2/3/4/5-bit)
|
| 4 |
+
results: []
|
| 5 |
license: apache-2.0
|
| 6 |
+
language:
|
| 7 |
+
- en
|
| 8 |
tags:
|
| 9 |
+
- ibm
|
| 10 |
+
- granite
|
| 11 |
- mlx
|
| 12 |
+
- apple-silicon
|
| 13 |
+
- mamba2
|
| 14 |
+
- transformer
|
| 15 |
+
- hybrid
|
| 16 |
+
- moe
|
| 17 |
+
- long-context
|
| 18 |
+
- instruct
|
| 19 |
+
- quantized
|
| 20 |
+
- 6bit
|
| 21 |
pipeline_tag: text-generation
|
| 22 |
+
library_name: mlx
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
# Granite-4.0-H-Tiny — **MLX 6-bit** (Apple Silicon)
|
| 26 |
+
**Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)
|
| 27 |
+
|
| 28 |
+
This repository provides an **Apple-Silicon MLX build** of **IBM Granite-4.0-H-Tiny** quantized to **6-bit**.
|
| 29 |
+
Among MLX quant variants, **6-bit** offers the **highest fidelity** while still fitting comfortably on modern M-series Macs. If your workload involves **precise extraction, structured outputs, or long contexts**, 6-bit is usually the best on-device choice.
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## 🔢 Choosing a quantization level (LMX variants)
|
| 34 |
+
Use this table as a **practical** guide for a ~7B hybrid MoE LM on Apple Silicon. (Figures vary by device/context.)
|
| 35 |
+
|
| 36 |
+
| Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to Choose |
|
| 37 |
+
|---|---:|:---:|---|---|
|
| 38 |
+
| **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest, most lossy | Minimal RAM devices; smoke tests |
|
| 39 |
+
| **3-bit** | ~5–6 GB | **🔥🔥🔥🔥** | Direct, concise | Great default on M1/M2/M3/M4 |
|
| 40 |
+
| **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | If 3-bit misses details |
|
| 41 |
+
| **5-bit** | ~8–9 GB | 🔥🔥☆ | Higher fidelity | Heavier docs/structured outputs |
|
| 42 |
+
| **6-bit** *(this repo)* | **~9.5–11 GB** | 🔥🔥 | **Highest MLX fidelity** | Best quality on-device if RAM permits |
|
| 43 |
+
|
| 44 |
+
**Tips**
|
| 45 |
+
- Prefer **6-bit** when you have ~10–12 GB free and want maximum quality.
|
| 46 |
+
- Use **3-bit/4-bit** for tighter RAM with good latency and strong baseline quality.
|
| 47 |
+
- For JSON/structured extraction, consider **temperature 0.0** and **schema-style prompts**.
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## 🔎 About Granite 4.0 (context for this build)
|
| 52 |
+
- **Architecture:** Hybrid **Mamba-2 + softmax attention**; *H* tiers add **Mixture-of-Experts (MoE)** for sparse activation and efficiency.
|
| 53 |
+
- **Model tier:** **H-Tiny** (~7B total params with ~1B active via MoE) — designed for **long-context** use and efficient serving.
|
| 54 |
+
- **License:** **Apache-2.0** (permissive, enterprise-friendly).
|
| 55 |
+
- **Use cases:** Instruction following, long-context assistants, RAG backends, structured outputs.
|
| 56 |
+
|
| 57 |
+
> This card documents the **MLX 6-bit** conversion. For lower-RAM devices, see the 2/3/4/5-bit guidance below.
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## 📦 Contents of this repository
|
| 64 |
+
- `config.json` (MLX), `mlx_model*.safetensors` (**6-bit** shards)
|
| 65 |
+
- Tokenizer files: `tokenizer.json`, `tokenizer_config.json`
|
| 66 |
+
- Any auxiliary metadata (e.g., `model_index.json`)
|
| 67 |
+
|
| 68 |
+
This build targets **macOS** on **Apple Silicon (M-series)** using **Metal/MPS**.
|
| 69 |
+
|
| 70 |
---
|
| 71 |
+
|
| 72 |
+
## ✅ Intended use
|
| 73 |
+
- **High-fidelity** instruction following and summarization
|
| 74 |
+
- **Long-context** reasoning and retrieval-augmented generation (RAG)
|
| 75 |
+
- **Structured extraction** (JSON, key–value) and document parsing
|
| 76 |
+
- On-device prototyping where **answer faithfulness** matters
|
| 77 |
+
|
| 78 |
+
## ⚠️ Limitations
|
| 79 |
+
- As with any quantization, small regressions vs FP16 can occur (complex math/code).
|
| 80 |
+
- **Token limits** and **KV-cache growth** still apply for very long contexts.
|
| 81 |
+
- Always add your own **guardrails/safety** for sensitive deployments.
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
## 🚀 Quickstart (CLI — MLX)
|
| 86 |
+
|
| 87 |
+
**Deterministic generation**
|
| 88 |
+
```bash
|
| 89 |
+
python -m mlx_lm.generate \
|
| 90 |
+
--model <this-repo-id> \
|
| 91 |
+
--prompt "Summarize the following meeting notes in 5 bullet points:\n<your text>" \
|
| 92 |
+
--max-tokens 256 \
|
| 93 |
+
--temperature 0.0 \
|
| 94 |
+
--device mps \
|
| 95 |
+
--seed 0
|