SparseVideoNav Architecture

SparseVideoNav: Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

Model Details

Model Description

SparseVideoNav introduces video generation models to real-world beyond-the-view vision-language navigation for the first time. It pioneers a paradigm shift from continuous to sparse video generation for longer prediction horizons. By guiding trajectory inference with a generated sparse future spanning a 20-second horizon, it achieves sub-second inference (a 27× speed-up). It also marks the first realization of beyond-the-view navigation in challenging night scenes.

  • Developed by: Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li
  • Shared by: The University of Hong Kong & OpenDriveLab
  • Model type: Video Generation-based Model for Vision-Language Navigation
  • Language(s) (NLP): English (Instruction prompts)
  • License: CC BY-NC-SA 4.0
  • Finetuned from model: Based on UMT5-XXL (text encoder) and Wan2.1 VAE.

Model Sources

Uses

Direct Use

The model is designed for generating sparse future video frames based on a current visual observation (video) and a natural language instruction (e.g., "turn right"). It is primarily intended for research in Embodied AI, specifically Vision-Language Navigation (VLN) in real-world environments.

Out-of-Scope Use

The model is a research prototype and is not intended for deployment in safety-critical real-world autonomous driving or robotic navigation systems without further extensive testing, safety validation, and fallback mechanisms.

How to Get Started with the Model

Use the code below to get started with the model using our custom pipeline.

Ensure you have cloned the GitHub repository and installed the requirements.

from omegaconf import OmegaConf
from inference import SVNPipeline

# Load configuration
cfg = OmegaConf.load("config/inference.yaml")
cfg.ckpt_path = "/path/to/models/SparseVideoNav-Models" # Path to your downloaded weights
cfg.inference.device = "cuda:0"

# Initialize pipeline
pipeline = SVNPipeline.from_pretrained(cfg)

# Run inference (Returns np.ndarray (T, H, W, C) uint8)
video = pipeline(video="/path/to/input.mp4", text="turn right") 

BibTeX

@article{zhang2026sparse,
  title={Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation},
  author={Zhang, Hai and Liang, Siqi and Chen, Li and Li, Yuxian and Xu, Yukuan and Zhong, Yichao and Zhang, Fu and Li, Hongyang},
  journal={arXiv preprint arXiv:2602.05827},
  year={2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for OpenDriveLab/SparseVideoNav_VGM