LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
Abstract
A new long-tail driving dataset with multi-view video, trajectories, and multilingual reasoning traces is introduced to improve few-shot generalization and evaluate multimodal models' instruction-following capabilities.
In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization. The resulting benchmark for multimodal models, such as VLMs and VLAs, goes beyond safety and comfort metrics by evaluating instruction following and semantic coherence between model outputs. The multilingual reasoning traces in English, Spanish, and Chinese are from domain experts with diverse cultural backgrounds. Thus, our dataset is a unique resource for studying how different forms of reasoning affect driving competence. Our dataset is available at: https://hf.co/datasets/kit-mrt/kitscenes-longtail
Community
We introduce a long-tail autonomous-driving dataset and benchmark that combines multi-view video, high-level instructions, and multilingual human-expert reasoning traces in English, Chinese, and Spanish.
In detail, we:
- measure semantic coherence to check whether a model's generated reasoning actually matches its planned trajectory, instead of rewarding plausible-sounding explanations alone,
- propose the Multi-Maneuver Score (MMS), a lightweight evaluation metric that scores safety, comfort, and instruction following across multiple valid futures, addressing the limitations of single-trajectory L2 evaluation,
- show that zero-shot planning in long-tail scenarios is brittle, few-shot prompting improves performance, and raw CoT prompting can still yield low reasoning–action alignment. Adding a simple kinematic bicycle model on top of predicted driving actions improves planning results.
Our dataset is available at: https://hf.co/datasets/kit-mrt/kitscenes-longtail
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios (2026)
- DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving (2026)
- HERMES: A Holistic End-to-End Risk-Aware Multimodal Embodied System with Vision-Language Models for Long-Tail Autonomous Driving (2026)
- StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving (2026)
- WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning (2026)
- Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving (2026)
- VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper