bytedance-research
/

Vidi-7B

Model card Files Files and versions

Zilence006 commited on 22 days ago

Commit

7bb91c8

·

verified ·

1 Parent(s): 55bdda9

Update README.md

Files changed (1) hide show

README.md +32 -1

README.md CHANGED Viewed

@@ -6,4 +6,35 @@ tags:
 - video
 - audio
 - multimodal
----

 - video
 - audio
 - multimodal
+---
+# [Vidi: Large Multimodal Models for Video Understanding and Editing](https://arxiv.org/pdf/2504.15681)
+Homepage: [https://bytedance.github.io/vidi-website/](https://bytedance.github.io/vidi-website/)
+Github: [https://github.com/bytedance/vidi](https://github.com/bytedance/vidi)
+Demo: [https://vidi.byteintl.com/](https://vidi.byteintl.com/)
+> We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understanding and editing (VUE) scenarios. The first release focuses on temporal retrieval (TR), i.e., identifying the time ranges in input videos corresponding to a given text query.
+This model is the first release for temporal retrieval.
+Please find the inference and evaluation code on [https://github.com/bytedance/vidi](https://github.com/bytedance/vidi).
+## Citation
+If you find Vidi useful for your research and applications, please cite using this BibTeX:
+```
+@article{Vidi2025vidi,
+    title={Vidi: Large Multimodal Models for Video
+            Understanding and Editing},
+    author={Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du,
+            Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang,
+            Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen,
+            Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin,
+            Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei,
+            Xueqiong Qu},
+    journal={arXiv preprint arXiv:2504.15681},
+    year={2025}
+}
+```