Update README.md

a817643 verified 2 months ago

7.02 kB

	---
	tags:
	- ColBERT
	- PyLate
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- code-search
	- knowledge-distillation
	- modernbert
	- apple-silicon
	- mps
	pipeline_tag: sentence-similarity
	library_name: PyLate
	license: apache-2.0
	language:
	- en
	datasets:
	- sentence-transformers/codesearchnet
	base_model: lightonai/ColBERT-Zero
	---

	# ColBERT-Zero-6L-CodeSearch

	A 6-layer ColBERT model distilled from [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) (22 layers) for code search, achieving 85% of the teacher's retrieval quality at 13x faster query speed.

	## Model Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| ModernBERT (6 layers, 768 hidden, 12 heads) \|
	\| Base Model \| [lightonai/ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) \|
	\| Output Dimensionality \| 128 per-token embeddings \|
	\| Similarity Function \| MaxSim (late interaction) \|
	\| Parameters \| ~38M (vs ~100M teacher) \|
	\| Query Length \| 32 tokens \|
	\| Document Length \| 180 tokens \|
	\| License \| Apache 2.0 \|

	## Benchmark Results

	Evaluated on 3 code search corpora (150 questions total) via [litembeddings](https://github.com/alexandernicholson/litembeddings):

	\| Corpus \| Teacher MRR \| Student MRR \| % of Teacher \| Student Query Speed \|
	\|--------\|------------\|-------------\|--------------\|---------------------\|
	\| jq (C) \| 0.539 \| 0.355 \| 65.9% \| ~7ms \|
	\| Rails (Ruby) \| 0.679 \| 0.581 \| 85.6% \| ~3ms \|
	\| FastAPI (Python) \| 0.782 \| 0.766 \| 98.0% \| ~4ms \|
	\| Aggregate \| 0.667 \| 0.568 \| 85.1% \| ~5ms \|

	The student model is approximately 13x faster at query time than the teacher while retaining 85% of retrieval quality. Performance is particularly strong on Python code search (98% of teacher).

	## How the Student Was Built

	### Architecture: Layer Pruning from Teacher

	The student was created by selecting 6 layers from ColBERT-Zero's 22-layer ModernBERT backbone using a skewed-late strategy that preserves more upper layers (which encode retrieval-relevant semantics):

	```
	Teacher layers: [0, 1, 2, ..., 21] (22 total)
	Student layers: [0, 8, 14, 17, 19, 21] (6 selected)
	```

	The student inherits:
	- All embedding weights from the teacher
	- The 768-to-128 ColBERT projection layer
	- Selected transformer layers with full weight copying

	### Training: Knowledge Distillation

	- Dataset: [CodeSearchNet](https://huggingface.co/datasets/sentence-transformers/codesearchnet) (10,000 comment-code pairs)
	- Teacher scoring: ColBERT-Zero generates MaxSim relevance scores for each query against 1 positive + 3 random negative documents
	- Loss: PyLate Distillation loss (KL divergence between teacher and student score distributions)
	- Optimizer: AdamW, lr=5e-5, weight_decay=0.01, warmup_ratio=0.1
	- Training: 1000 steps, batch_size=8, gradient_accumulation=4 (effective batch size 32)
	- Hardware: Apple Silicon (M4 Max) via PyTorch MPS backend, ~17 minutes total

	### Hyperparameter Search

	The optimal configuration was found through 30 autonomous experiments sweeping learning rate, layer selection strategy, batch size, gradient accumulation, weight decay, warmup ratio, number of negatives, training steps, and embedding dimensions. Key findings:

	- Teacher initialization is critical: Starting from ColBERT-Zero's weights (MRR 0.46) vs raw ModernBERT (MRR 0.08) — a 5.6x improvement
	- Skewed-late layer selection outperforms evenly-spaced, last-6, and other strategies
	- Effective batch size 32 (bs=8, grad_accum=4) is optimal
	- Weight decay 0.01 provides regularization benefit

	## Usage

	### Installation

	```bash
	pip install pylate
	```

	### Encoding & Retrieval

	```python
	from pylate import indexes, models, retrieve

	# Load model
	model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")

	# Encode documents
	doc_embeddings = model.encode(
	["def hello():\n print('Hello, World!')", "class UserAuth:\n ..."],
	batch_size=32,
	is_query=False,
	show_progress_bar=True,
	)

	# Encode queries
	query_embeddings = model.encode(
	["function that prints a greeting"],
	batch_size=32,
	is_query=True,
	show_progress_bar=True,
	)

	# Score with MaxSim
	from pylate.scores import colbert_scores
	scores = colbert_scores(query_embeddings, doc_embeddings)
	print(scores) # Higher = more relevant
	```

	### Reranking

	```python
	from pylate import rank, models

	model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")

	queries = ["how to authenticate users"]
	documents = [["def login(user, pwd): ...", "def sort_list(arr): ...", "class AuthMiddleware: ..."]]
	documents_ids = [["doc1", "doc2", "doc3"]]

	queries_embeddings = model.encode(queries, is_query=True)
	documents_embeddings = model.encode(documents, is_query=False)

	reranked = rank.rerank(
	documents_ids=documents_ids,
	queries_embeddings=queries_embeddings,
	documents_embeddings=documents_embeddings,
	)
	```

	## GGUF / litembeddings

	This model can be converted to GGUF format for use with [litembeddings](https://github.com/alexandernicholson/litembeddings) (SQLite-based embedding engine with SIMD-accelerated MaxSim):

	```bash
	# Convert to GGUF
	python convert_hf_to_gguf.py ctrltokyo/ColBERT-Zero-6L-CodeSearch --outfile model-f16.gguf --outtype f16

	# Extract projection
	python -c "
	from safetensors import safe_open
	import numpy as np
	f = safe_open('1_Dense/model.safetensors', framework='numpy')
	f.get_tensor('linear.weight').astype(np.float32).tofile('model.projection')
	"
	```

	Then in SQL:
	```sql
	SELECT lembed_model('codesearch', 'model-f16.gguf', '{"colbert_projection": "model.projection"}');
	SELECT lembed_maxsim(
	lembed_tokens('search_query: how to sort a list'),
	lembed_tokens('search_document: def quicksort(arr): ...')
	);
	```

	## Limitations

	- Weakest on C code search (65.9% of teacher on jq corpus) — likely because CodeSearchNet training data is Python-heavy
	- Trained on 10k pairs only — larger training sets or hard negative mining could improve quality further
	- English only — inherits ColBERT-Zero's language capabilities
	- No asymmetric prompts — unlike the teacher, this model does not use `search_query:`/`search_document:` prompts (uses `[Q]`/`[D]` prefixes instead)

	## Citation

	```bibtex
	@misc{colbert-zero-6l-codesearch,
	title={ColBERT-Zero-6L-CodeSearch: A Distilled ColBERT Model for Code Search},
	author={Alexander Nicholson},
	year={2026},
	note={Distilled from ColBERT-Zero (Chaffin et al., 2026) using PyLate on Apple Silicon}
	}
	```

	## Acknowledgments

	- [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) by LightOn AI — the teacher model
	- [PyLate](https://github.com/lightonai/pylate) — ColBERT training framework
	- [litembeddings](https://github.com/alexandernicholson/litembeddings) — SQLite embedding engine used for benchmarking
	- Training and experimentation performed entirely on Apple Silicon (M4 Max) using PyTorch MPS backend