Official model checkpoints for TC-CLIP
If you like our project, please give us a star ⭐ on Github for the latest update.
Introduction
We present Temporally Contextualized CLIP (TC-CLIP): A novel video understanding framework that leverages holistic video information within its encoding process.
- Temporal Contextualization (TC): Unlike prior approaches that access only a limited amount of tokens, TC allows global interactions by
summarizing informative tokens from the entire video into context tokens and leveraging them during the feature encoding process.
- Video-conditional Prompting (VP): Based on the summarized context tokens from the visual domain, VP generates instance-level textual prompts that compensate for the lack of textual semantics in action recognition datasets.
- Solid performance: TC-CLIP achieves stat-of-the-art performance across zero-shot, few-shot, base-to-novel, fully-supervised settings on five video action recognition benchmarks.
This repository contains all model checkpoints used in our experiments.
Models
We use CLIP ViT-B/16 for all experiments below.
- (LLM) denotes that the models are using LLM-rephrased category names from FROSTER. Note that experiments on the SSv2 dataset do not involve LLM-rephrasing.
- (P) denotes that the models are first pretrained on Kinetics-400 and subsequently fine-tuned on each dataset. Otherwise, models are directly fine-tuned from CLIP. See Appendix A in the paper.
Zero-shot action recognition
Few-shot action recognition
| Scripts |
HMDB-51 |
UCF-101 |
SSv2 |
Ckpt |
|
K=2 / K=4 / K=8 / K=16 |
K=2 / K=4 / K=8 / K=16 |
K=2 / K=4 / K=8 / K=16 |
|
| TC-CLIP |
57.3 / 62.3 / 67.3 / 68.6 |
85.9 / 89.9 / 92.5 / 94.6 |
7.3 / 8.6 / 9.3 / 14.0 |
Link |
| TC-CLIP (LLM) |
58.6 / 63.3 / 65.5 / 68.8 |
86.8 / 90.1 / 92.0 / 94.3 |
7.3 / 8.6 / 9.3 / 14.0 |
Link |
| TC-CLIP (P) |
65.3 / 68.5 / 71.4 / 73.0 |
94.1 / 95.6 / 96.6 / 97.3 |
8.7 / 10.1 / 12.1 / 15.2 |
Link |
Base-to-novel generalization
| Scripts |
K-400 |
HMDB-51 |
UCF-101 |
SSv2 |
Ckpt |
|
Base / Novel / HM |
Base / Novel / HM |
Base / Novel / HM |
Base / Novel / HM |
|
| TC-CLIP |
78.9 / 63.6 / 70.4 |
73.3 / 54.1 / 62.2 |
95.5 / 78.0 / 85.9 |
17.5 / 13.4 / 15.2 |
Link |
| TC-CLIP (LLM) |
79.1 / 65.4 / 71.6 |
73.3 / 59.1 / 65.5 |
95.4 / 81.6 / 88.0 |
17.5 / 13.4 / 15.2 |
Link |
| TC-CLIP (P) |
N/A |
79.4 / 58.3 / 67.2 |
97.5 / 84.5 / 90.5 |
19.6 / 15.6 / 17.4 |
Link |
Fully-supervised action recognition
| Scripts |
K-400 (Top-1) |
K-400 (Top-5) |
Ckpt |
| TC-CLIP |
85.2 |
96.9 |
Link |
Citation
If you find TC-CLIP useful in your research, please consider citing our paper:
@article{kim2024tcclip,
title={Leveraging Temporal Contextualization for Video Action Recognition},
author={Kim, Minji and Han, Dongyoon and Kim, Taekyung and Han, Bohyung},
journal={European Conference on Computer Vision (ECCV)},
year={2024}
}