CLIPCLAP β€” Unified Text + Image + Audio Embeddings

CLIPCLAP is a unified multimodal embedding model that maps text, images, and audio into a shared 512-dimensional vector space. It combines OpenAI's CLIP (text + image) with LAION's CLAP (audio) through a trained linear projection.

Built by antflydb for use with Termite, a standalone ML inference service for embeddings, chunking, and reranking.

Architecture

Text  ──→ CLIP text encoder  ──→ text_projection  ──→ 512-dim (CLIP space)
Image ──→ CLIP visual encoder ──→ visual_projection ──→ 512-dim (CLIP space)
Audio ──→ CLAP audio encoder  ──→ audio_projection  ──→ 512-dim (CLIP space)
  • Text & Image: Standard CLIP ViT-B/32 encoders and projections (unchanged from openai/clip-vit-base-patch32).
  • Audio: CLAP HTSAT audio encoder from laion/larger_clap_music_and_speech. The audio projection combines CLAP's native audio projection (1024β†’512) with a trained 512β†’512 linear layer that maps CLAP audio space into CLIP space.

All three modalities produce 512-dimensional L2-normalized embeddings that are directly comparable via cosine similarity.

Intended Uses

  • Multimodal search (text↔image↔audio)
  • Building unified media indexes with Antfly
  • Cross-modal retrieval (find images from audio queries, audio from text, etc.)
  • Audio-visual content discovery

How to Use with Termite

# Pull and run the model
termite pull clipclap
termite run

# Embed text
curl -X POST http://localhost:8082/embed \
  -H "Content-Type: application/json" \
  -d '{
    "model": "clipclap",
    "input": [
      {"type": "text", "text": "a cat sitting on a windowsill"},
      {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
      {"type": "audio_url", "audio_url": {"url": "https://example.com/cat-purring.wav"}}
    ]
  }'

Training Details

Audio Projection

The audio projection layer bridges CLAP and CLIP embedding spaces. Training procedure:

  1. Load audio-caption pairs from OpenSound/AudioCaps
  2. Encode audio through CLAP: audio encoder β†’ audio_projection β†’ L2 normalize
  3. Encode captions through CLIP: text encoder β†’ text_projection β†’ L2 normalize
  4. Train a 512β†’512 linear projection (CLAP audio β†’ CLIP text) using CLIP-style contrastive loss (InfoNCE)

The contrastive loss pushes matching audio-text pairs together while pushing non-matching pairs apart within each batch, preserving content discrimination.

Hyperparameters

Parameter Value
Training dataset OpenSound/AudioCaps
Samples 5000 audio-caption pairs
Epochs 20
Batch size 256
Learning rate 1e-3
Optimizer Adam
Loss Symmetric InfoNCE (temperature=0.07)
Train/val split 90/10

Source Models

ONNX Files

File Description Size
text_model.onnx CLIP text encoder ~254 MB
visual_model.onnx CLIP visual encoder ~330 MB
text_projection.onnx CLIP text projection (512β†’512) ~4 KB
visual_projection.onnx CLIP visual projection (768β†’512) ~6 KB
audio_model.onnx CLAP HTSAT audio encoder ~590 MB
audio_projection.onnx Combined CLAP→CLIP projection (1024→512) ~8 KB

Additional files: clip_config.json, tokenizer.json, preprocessor_config.json, projection_training_metadata.json.

Limitations

  • Audio duration: Audio is truncated to ~10 seconds (inherited from CLAP)
  • Language: Primarily English text support
  • Audio-visual alignment: The projection is trained via caption similarity (audio↔text↔image), not direct audio-image pairs. Audio-to-image retrieval may be less precise than text-to-image.
  • CLIP limitations: Inherits CLIP's weaknesses in fine-grained visual classification, object counting, and abstract concepts
  • Training data: Audio projection trained on AudioCaps which covers common environmental sounds and may underperform on niche audio domains

Citation

If you use CLIPCLAP, please cite the underlying models:

@inproceedings{radford2021clip,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
  booktitle={ICML},
  year={2021}
}

@inproceedings{wu2023clap,
  title={Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  author={Wu, Yusong and Chen, Ke and Zhang, Tianyu and others},
  booktitle={ICASSP},
  year={2023}
}
Downloads last month
56
GGUF
Model size
68.8M params
Architecture
clap
Hardware compatibility
Log In to add your hardware
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train antflydb/clipclap