Papers
arxiv:2604.08121

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Published on Apr 9
· Submitted by
GONG JIA
on Apr 14
Authors:
,
,
,
,
,
,

Abstract

Uni-ViGU presents a generation-centric approach to unified multimodal video understanding and generation by extending video generation as a foundation through unified flow matching and bidirectional training mechanisms.

AI-generated summary

Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.

Community

Paper submitter

Here's an interesting observation that motivated this work: video generation is computationally much more expensive than video understanding. So why do most unified multimodal models start with understanding-focused architectures (like MLLMs) and then try to bolt on generation capabilities? The authors argue we should flip this around — start with a powerful video generator and extend it to handle understanding tasks. It's a bit like saying "if you can write a novel, you probably understand language pretty well already." Project page: https://fr0zencrane.github.io/uni-vigu-page/

·

Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/uni-vigu-towards-unified-video-generation-and-understanding-via-a-diffusion-based-video-generator-5526-33b8d679
Covers the executive summary, detailed methodology, and practical applications.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.08121
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.08121 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.08121 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.08121 in a Space README.md to link it from this page.

Collections including this paper 1