Upload PatentTEB model: patembed-small

d0829c7 verified 5 months ago

5.19 kB

	---
	license: cc-by-nc-sa-4.0
	library_name: sentence-transformers
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- patent
	- embeddings
	- mteb
	language:
	- en
	pipeline_tag: sentence-similarity
	---

	# patembed-small

	This is a sentence-transformers model trained specifically for patent text embeddings. It is part of the PatenTEB project, which provides state-of-the-art models for patent document understanding and retrieval.

	Note: This model uses task-specific instruction prompts during inference for optimal performance.

	## Model Details

	- Model Type: Sentence Transformer
	- Base Architecture: Distilled from patembed-large using layers {0,4,8,12,16,20}
	- Parameters: 117M
	- Number of Layers: 6
	- Hidden Size: 1024
	- Embedding Dimension: 384
	- Max Sequence Length: 512 tokens
	- Language: English
	- License: CC BY-NC-SA 4.0

	## Model Description

	Resource-limited deployment variant. Maintains 1024 hidden size with projection to 384-dim embeddings.

	This model is part of the patembed family, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper.



	## Usage

	### Using Sentence Transformers

	```python
	from sentence_transformers import SentenceTransformer

	# Load the model
	model = SentenceTransformer('datalyes/patembed-small')

	# Encode patent texts
	patent_texts = [
	"A method for manufacturing semiconductor devices...",
	"An apparatus for processing chemical compounds...",
	]
	embeddings = model.encode(patent_texts)

	# Compute similarity
	from sentence_transformers import util
	similarity = util.cos_sim(embeddings[0], embeddings[1])
	print(f"Similarity: {similarity.item():.4f}")
	```

	### Using Transformers

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch
	import torch.nn.functional as F

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-small')
	model = AutoModel.from_pretrained('datalyes/patembed-small')

	def mean_pooling(model_output, attention_mask):
	token_embeddings = model_output[0]
	input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
	return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

	# Tokenize and encode
	texts = ["A method for manufacturing semiconductor devices..."]
	encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

	with torch.no_grad():
	model_output = model(**encoded)
	embeddings = mean_pooling(model_output, encoded['attention_mask'])
	embeddings = F.normalize(embeddings, p=2, dim=1)
	```

	### Patent Retrieval Example

	```python
	from sentence_transformers import SentenceTransformer, util

	model = SentenceTransformer('datalyes/patembed-small')

	# Query patent
	query = "Method for reducing power consumption in mobile devices"

	# Candidate patents
	candidates = [
	"A power management system for portable electronic devices...",
	"Chemical composition for battery manufacturing...",
	"Method for wireless data transmission in mobile networks...",
	]

	# Encode and retrieve
	query_emb = model.encode(query)
	candidate_embs = model.encode(candidates)

	# Compute similarities
	scores = util.cos_sim(query_emb, candidate_embs)[0]

	# Get ranked results
	results = [(candidates[i], scores[i].item()) for i in range(len(candidates))]
	results.sort(key=lambda x: x[1], reverse=True)

	for patent, score in results:
	print(f"Score: {score:.4f} - {patent[:100]}...")
	```

	## Intended Use

	This model is designed for patent-specific tasks including:
	- Patent search and retrieval
	- Prior art search
	- Patent classification and clustering
	- Technology landscape analysis

	For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper.

	## Citation

	If you use this model, please cite our paper:

	```bibtex
	@misc{ayaou2025patentebcomprehensivebenchmarkmodel,
	title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding},
	author={Iliass Ayaou and Denis Cavallucci},
	year={2025},
	eprint={2510.22264},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2510.22264}
	}
	```

	Paper: [PatenTEB on arXiv](https://arxiv.org/abs/2510.22264)

	## License

	This model is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

	Key Terms:
	- ✅ You can use, share, and adapt the model
	- ✅ You must give appropriate credit
	- ❌ You may not use the model for commercial purposes
	- ⚠️ If you adapt or build upon this model, you must distribute under the same license

	For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/

	## Contact

	- Authors: Iliass Ayaou, Denis Cavallucci
	- Institution: ICUBE Laboratory, INSA Strasbourg
	- GitHub: [PatentTEB/PatentTEB](https://github.com/iliass-y/patenteb)
	- HuggingFace: [datalyes](https://huggingface.co/datalyes)