| --- |
| license: cc-by-nc-sa-4.0 |
| library_name: sentence-transformers |
| tags: |
| - sentence-transformers |
| - sentence-similarity |
| - feature-extraction |
| - patent |
| - embeddings |
| - mteb |
| language: |
| - en |
| pipeline_tag: sentence-similarity |
| --- |
| |
| # patembed-small |
|
|
| This is a **sentence-transformers** model trained specifically for **patent text embeddings**. It is part of the **PatenTEB** project, which provides state-of-the-art models for patent document understanding and retrieval. |
|
|
| **Note:** This model uses task-specific instruction prompts during inference for optimal performance. |
|
|
| ## Model Details |
|
|
| - **Model Type**: Sentence Transformer |
| - **Base Architecture**: Distilled from patembed-large using layers {0,4,8,12,16,20} |
| - **Parameters**: 117M |
| - **Number of Layers**: 6 |
| - **Hidden Size**: 1024 |
| - **Embedding Dimension**: 384 |
| - **Max Sequence Length**: 512 tokens |
| - **Language**: English |
| - **License**: CC BY-NC-SA 4.0 |
|
|
| ## Model Description |
|
|
| Resource-limited deployment variant. Maintains 1024 hidden size with projection to 384-dim embeddings. |
|
|
| This model is part of the **patembed family**, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper. |
|
|
|
|
|
|
| ## Usage |
|
|
| ### Using Sentence Transformers |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer |
| |
| # Load the model |
| model = SentenceTransformer('datalyes/patembed-small') |
| |
| # Encode patent texts |
| patent_texts = [ |
| "A method for manufacturing semiconductor devices...", |
| "An apparatus for processing chemical compounds...", |
| ] |
| embeddings = model.encode(patent_texts) |
| |
| # Compute similarity |
| from sentence_transformers import util |
| similarity = util.cos_sim(embeddings[0], embeddings[1]) |
| print(f"Similarity: {similarity.item():.4f}") |
| ``` |
|
|
| ### Using Transformers |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| import torch |
| import torch.nn.functional as F |
| |
| # Load model and tokenizer |
| tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-small') |
| model = AutoModel.from_pretrained('datalyes/patembed-small') |
| |
| def mean_pooling(model_output, attention_mask): |
| token_embeddings = model_output[0] |
| input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
| return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
| |
| # Tokenize and encode |
| texts = ["A method for manufacturing semiconductor devices..."] |
| encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') |
| |
| with torch.no_grad(): |
| model_output = model(**encoded) |
| embeddings = mean_pooling(model_output, encoded['attention_mask']) |
| embeddings = F.normalize(embeddings, p=2, dim=1) |
| ``` |
|
|
| ### Patent Retrieval Example |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer, util |
| |
| model = SentenceTransformer('datalyes/patembed-small') |
| |
| # Query patent |
| query = "Method for reducing power consumption in mobile devices" |
| |
| # Candidate patents |
| candidates = [ |
| "A power management system for portable electronic devices...", |
| "Chemical composition for battery manufacturing...", |
| "Method for wireless data transmission in mobile networks...", |
| ] |
| |
| # Encode and retrieve |
| query_emb = model.encode(query) |
| candidate_embs = model.encode(candidates) |
| |
| # Compute similarities |
| scores = util.cos_sim(query_emb, candidate_embs)[0] |
| |
| # Get ranked results |
| results = [(candidates[i], scores[i].item()) for i in range(len(candidates))] |
| results.sort(key=lambda x: x[1], reverse=True) |
| |
| for patent, score in results: |
| print(f"Score: {score:.4f} - {patent[:100]}...") |
| ``` |
|
|
| ## Intended Use |
|
|
| This model is designed for patent-specific tasks including: |
| - Patent search and retrieval |
| - Prior art search |
| - Patent classification and clustering |
| - Technology landscape analysis |
|
|
| For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper. |
|
|
| ## Citation |
|
|
| If you use this model, please cite our paper: |
|
|
| ```bibtex |
| @misc{ayaou2025patentebcomprehensivebenchmarkmodel, |
| title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding}, |
| author={Iliass Ayaou and Denis Cavallucci}, |
| year={2025}, |
| eprint={2510.22264}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2510.22264} |
| } |
| ``` |
|
|
| **Paper**: [PatenTEB on arXiv](https://arxiv.org/abs/2510.22264) |
|
|
| ## License |
|
|
| This model is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** license. |
|
|
| **Key Terms:** |
| - ✅ You can use, share, and adapt the model |
| - ✅ You must give appropriate credit |
| - ❌ You may not use the model for commercial purposes |
| - ⚠️ If you adapt or build upon this model, you must distribute under the same license |
|
|
| For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/ |
|
|
| ## Contact |
|
|
| - **Authors**: Iliass Ayaou, Denis Cavallucci |
| - **Institution**: ICUBE Laboratory, INSA Strasbourg |
| - **GitHub**: [PatentTEB/PatentTEB](https://github.com/iliass-y/patenteb) |
| - **HuggingFace**: [datalyes](https://huggingface.co/datalyes) |
|
|