Introduction

We present Temporally Contextualized CLIP (TC-CLIP): A novel video understanding framework that leverages holistic video information within its encoding process.

Temporal Contextualization (TC): Unlike prior approaches that access only a limited amount of tokens, TC allows global interactions by summarizing informative tokens from the entire video into context tokens and leveraging them during the feature encoding process.
Video-conditional Prompting (VP): Based on the summarized context tokens from the visual domain, VP generates instance-level textual prompts that compensate for the lack of textual semantics in action recognition datasets.
Solid performance: TC-CLIP achieves stat-of-the-art performance across zero-shot, few-shot, base-to-novel, fully-supervised settings on five video action recognition benchmarks.

This repository contains all model checkpoints used in our experiments.

Models

We use CLIP ViT-B/16 for all experiments below.

(LLM) denotes that the models are using LLM-rephrased category names from FROSTER. Note that experiments on the SSv2 dataset do not involve LLM-rephrasing.
(P) denotes that the models are first pretrained on Kinetics-400 and subsequently fine-tuned on each dataset. Otherwise, models are directly fine-tuned from CLIP. See Appendix A in the paper.

Zero-shot action recognition

Scripts	HMDB-51	UCF-101	Kinetics-600	Ckpt
TC-CLIP	54.2 ± 0.7	82.9 ± 0.6	75.8 ± 0.5	Link
TC-CLIP (LLM)	56.0 ± 0.3	85.4 ± 0.8	78.1 ± 1.0	Link

Few-shot action recognition

Scripts	HMDB-51	UCF-101	SSv2	Ckpt
	K=2 / K=4 / K=8 / K=16	K=2 / K=4 / K=8 / K=16	K=2 / K=4 / K=8 / K=16
TC-CLIP	57.3 / 62.3 / 67.3 / 68.6	85.9 / 89.9 / 92.5 / 94.6	7.3 / 8.6 / 9.3 / 14.0	Link
TC-CLIP (LLM)	58.6 / 63.3 / 65.5 / 68.8	86.8 / 90.1 / 92.0 / 94.3	7.3 / 8.6 / 9.3 / 14.0	Link
TC-CLIP (P)	65.3 / 68.5 / 71.4 / 73.0	94.1 / 95.6 / 96.6 / 97.3	8.7 / 10.1 / 12.1 / 15.2	Link

Base-to-novel generalization

Scripts	K-400	HMDB-51	UCF-101	SSv2	Ckpt
	Base / Novel / HM	Base / Novel / HM	Base / Novel / HM	Base / Novel / HM
TC-CLIP	78.9 / 63.6 / 70.4	73.3 / 54.1 / 62.2	95.5 / 78.0 / 85.9	17.5 / 13.4 / 15.2	Link
TC-CLIP (LLM)	79.1 / 65.4 / 71.6	73.3 / 59.1 / 65.5	95.4 / 81.6 / 88.0	17.5 / 13.4 / 15.2	Link
TC-CLIP (P)	N/A	79.4 / 58.3 / 67.2	97.5 / 84.5 / 90.5	19.6 / 15.6 / 17.4	Link

Fully-supervised action recognition

Scripts	K-400 (Top-1)	K-400 (Top-5)	Ckpt
TC-CLIP	85.2	96.9	Link

Citation

If you find TC-CLIP useful in your research, please consider citing our paper:

@article{kim2024tcclip,
  title={Leveraging Temporal Contextualization for Video Action Recognition},
  author={Kim, Minji and Han, Dongyoon and Kim, Taekyung and Han, Bohyung},
  journal={European Conference on Computer Vision (ECCV)},
  year={2024}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for byminji/TC-CLIP

Base model

openai/clip-vit-base-patch16

Finetuned

(52)

this model

Paper for byminji/TC-CLIP

Leveraging Temporal Contextualization for Video Action Recognition

Paper • 2404.09490 • Published Apr 15, 2024

byminji
/

TC-CLIP

[ECCV 2024] Leveraging Temporal Contextualization for Video Action Recognition

Official model checkpoints for TC-CLIP

If you like our project, please give us a star ⭐ on Github for the latest update.