Susant-Achary commited on
Commit
11ddab8
·
verified ·
1 Parent(s): c8b6ef2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -4
README.md CHANGED
@@ -1,10 +1,95 @@
1
  ---
 
 
 
2
  license: apache-2.0
3
- library_name: mlx
 
4
  tags:
5
- - language
6
- - granite-4.0
7
  - mlx
 
 
 
 
 
 
 
 
 
8
  pipeline_tag: text-generation
9
- base_model: ibm-granite/granite-4.0-h-tiny
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ model-index:
3
+ - name: Granite-4.0-H-Tiny — MLX (Apple Silicon), **6-bit** (with guidance for 2/3/4/5-bit)
4
+ results: []
5
  license: apache-2.0
6
+ language:
7
+ - en
8
  tags:
9
+ - ibm
10
+ - granite
11
  - mlx
12
+ - apple-silicon
13
+ - mamba2
14
+ - transformer
15
+ - hybrid
16
+ - moe
17
+ - long-context
18
+ - instruct
19
+ - quantized
20
+ - 6bit
21
  pipeline_tag: text-generation
22
+ library_name: mlx
23
+ ---
24
+
25
+ # Granite-4.0-H-Tiny — **MLX 6-bit** (Apple Silicon)
26
+ **Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)
27
+
28
+ This repository provides an **Apple-Silicon MLX build** of **IBM Granite-4.0-H-Tiny** quantized to **6-bit**.
29
+ Among MLX quant variants, **6-bit** offers the **highest fidelity** while still fitting comfortably on modern M-series Macs. If your workload involves **precise extraction, structured outputs, or long contexts**, 6-bit is usually the best on-device choice.
30
+
31
+ ---
32
+
33
+ ## 🔢 Choosing a quantization level (LMX variants)
34
+ Use this table as a **practical** guide for a ~7B hybrid MoE LM on Apple Silicon. (Figures vary by device/context.)
35
+
36
+ | Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to Choose |
37
+ |---|---:|:---:|---|---|
38
+ | **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest, most lossy | Minimal RAM devices; smoke tests |
39
+ | **3-bit** | ~5–6 GB | **🔥🔥🔥🔥** | Direct, concise | Great default on M1/M2/M3/M4 |
40
+ | **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | If 3-bit misses details |
41
+ | **5-bit** | ~8–9 GB | 🔥🔥☆ | Higher fidelity | Heavier docs/structured outputs |
42
+ | **6-bit** *(this repo)* | **~9.5–11 GB** | 🔥🔥 | **Highest MLX fidelity** | Best quality on-device if RAM permits |
43
+
44
+ **Tips**
45
+ - Prefer **6-bit** when you have ~10–12 GB free and want maximum quality.
46
+ - Use **3-bit/4-bit** for tighter RAM with good latency and strong baseline quality.
47
+ - For JSON/structured extraction, consider **temperature 0.0** and **schema-style prompts**.
48
+
49
+ ---
50
+
51
+ ## 🔎 About Granite 4.0 (context for this build)
52
+ - **Architecture:** Hybrid **Mamba-2 + softmax attention**; *H* tiers add **Mixture-of-Experts (MoE)** for sparse activation and efficiency.
53
+ - **Model tier:** **H-Tiny** (~7B total params with ~1B active via MoE) — designed for **long-context** use and efficient serving.
54
+ - **License:** **Apache-2.0** (permissive, enterprise-friendly).
55
+ - **Use cases:** Instruction following, long-context assistants, RAG backends, structured outputs.
56
+
57
+ > This card documents the **MLX 6-bit** conversion. For lower-RAM devices, see the 2/3/4/5-bit guidance below.
58
+
59
+
60
+
61
+ ---
62
+
63
+ ## 📦 Contents of this repository
64
+ - `config.json` (MLX), `mlx_model*.safetensors` (**6-bit** shards)
65
+ - Tokenizer files: `tokenizer.json`, `tokenizer_config.json`
66
+ - Any auxiliary metadata (e.g., `model_index.json`)
67
+
68
+ This build targets **macOS** on **Apple Silicon (M-series)** using **Metal/MPS**.
69
+
70
  ---
71
+
72
+ ## ✅ Intended use
73
+ - **High-fidelity** instruction following and summarization
74
+ - **Long-context** reasoning and retrieval-augmented generation (RAG)
75
+ - **Structured extraction** (JSON, key–value) and document parsing
76
+ - On-device prototyping where **answer faithfulness** matters
77
+
78
+ ## ⚠️ Limitations
79
+ - As with any quantization, small regressions vs FP16 can occur (complex math/code).
80
+ - **Token limits** and **KV-cache growth** still apply for very long contexts.
81
+ - Always add your own **guardrails/safety** for sensitive deployments.
82
+
83
+
84
+
85
+ ## 🚀 Quickstart (CLI — MLX)
86
+
87
+ **Deterministic generation**
88
+ ```bash
89
+ python -m mlx_lm.generate \
90
+ --model <this-repo-id> \
91
+ --prompt "Summarize the following meeting notes in 5 bullet points:\n<your text>" \
92
+ --max-tokens 256 \
93
+ --temperature 0.0 \
94
+ --device mps \
95
+ --seed 0