BERT base for SMILES

This is bidirectional transformer pretrained on SMILES (simplified molecular-input line-entry system) strings.

Example: Amoxicillin

O=C([C@@H](c1ccc(cc1)O)N)N[C@@H]1C(=O)N2[C@@H]1SC([C@@H]2C(=O)O)(C)C

Two training objectives were used:

  1. masked language modeling
  2. molecular-formula validity prediction

Intended uses

This model is primarily aimed at being fine-tuned on the following tasks:

  • molecule classification
  • molecule-to-gene-expression mapping
  • cell targeting

How to use in your code

from transformers import BertTokenizerFast, BertModel
checkpoint = 'unikei/bert-base-smiles'
tokenizer = BertTokenizerFast.from_pretrained(checkpoint)
model = BertModel.from_pretrained(checkpoint)

example = 'O=C([C@@H](c1ccc(cc1)O)N)N[C@@H]1C(=O)N2[C@@H]1SC([C@@H]2C(=O)O)(C)C'
tokens = tokenizer(example, return_tensors='pt')
predictions = model(**tokens)

Research

  • Jouary et al. (2025) Bridging scales between chemical space and behavioral phenotype:

    A cross-modal mapping between behavior and molecular structure, derived using the unikei/bert-base-smiles model, effectively distinguished between distinct neurotransmitter classes, such as dopaminergic/serotonergic ligands, purines, and metabotropic glutamate ligands.

Downloads last month
972
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Spaces using unikei/bert-base-smiles 3