arxiv:2604.08120

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Published on Apr 9

· Submitted by

Junjie Fei on Apr 10

Upvote

Authors:

Junjie Fei ,

Abstract

Tempo is an efficient framework that compresses long videos for multimodal understanding by using a small vision-language model for temporal compression and adaptive token allocation to maintain intent-aligned representations within strict budgets.

AI-generated summary

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free O(1) dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

View arXiv page View PDF Project page GitHub 8 Add to collection

Community

FeiElysia

Paper author Paper submitter about 7 hours ago

•

edited about 6 hours ago

🔥 Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding

How do we make MLLMs understand hour-long videos without saturating context windows? Tempo uses an SVLM to actively filter and compress videos via query-aware cross-modal distillation in a single forward pass!

🏆 SOTA Performance: Outperforms other long video MLLMs on the extreme-long LVBench (52.3 at 8K budget).

Everything is open-sourced! Try it out:

🤗 Interactive Space: https://huggingface.co/spaces/Vision-CAIR/Tempo
💻 GitHub: https://github.com/FeiElysia/Tempo
🌐 Project Page: https://feielysia.github.io/tempo-page/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.08120

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.08120 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.