File size: 5,682 Bytes
6544606
 
f4a6c9b
6544606
 
 
 
 
 
 
 
 
 
 
 
 
f4a6c9b
6544606
 
 
 
 
f4a6c9b
 
 
 
6544606
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f4a6c9b
6544606
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f4a6c9b
6544606
 
 
 
 
 
 
f4a6c9b
6544606
 
 
 
 
 
f4a6c9b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
license: apache-2.0
pipeline_tag: image-to-3d
tags:
- depth-estimation
- computer-vision
- monocular-depth
- multi-view-geometry
- pose-estimation
---

# Depth Anything 3: DA3-BASE

<div align="center">

[![Project Page](https://img.shields.io/badge/Project_Page-Depth_Anything_3-green)](https://depth-anything-3.github.io)
[![Paper](https://img.shields.io/badge/arXiv-Depth_Anything_3-red)](https://arxiv.org/abs/2511.10647)
[![Demo](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue)](https://huggingface.co/spaces/depth-anything/Depth-Anything-3)  # noqa: E501
<!-- Benchmark badge removed as per request -->

</div>

## Abstract

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

## Model Description

DA3 Base model for multi-view depth estimation and camera pose estimation. Compact foundation model with unified depth-ray representation.

| Property | Value |
|----------|-------|
| **Model Series** | Any-view Model |
| **Parameters** | 0.12B |
| **License** | Apache 2.0 |



## Capabilities

- βœ… Relative Depth
- βœ… Pose Estimation
- βœ… Pose Conditioning

## Quick Start

### Installation

```bash
git clone https://github.com/ByteDance-Seed/depth-anything-3
cd depth-anything-3
pip install -e .
```

### Basic Example

```python
import torch
from depth_anything_3.api import DepthAnything3

# Load model from Hugging Face Hub
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DepthAnything3.from_pretrained("depth-anything/da3-base")
model = model.to(device=device)

# Run inference on images
images = ["image1.jpg", "image2.jpg"]  # List of image paths, PIL Images, or numpy arrays
prediction = model.inference(
    images,
    export_dir="output",
    export_format="glb"  # Options: glb, npz, ply, mini_npz, gs_ply, gs_video
)

# Access results
print(prediction.depth.shape)        # Depth maps: [N, H, W] float32
print(prediction.conf.shape)         # Confidence maps: [N, H, W] float32
print(prediction.extrinsics.shape)   # Camera poses (w2c): [N, 3, 4] float32
print(prediction.intrinsics.shape)   # Camera intrinsics: [N, 3, 3] float32
```

### Command Line Interface

```bash
# Process images with auto mode
da3 auto path/to/images \
    --export-format glb \
    --export-dir output \
    --model-dir depth-anything/da3-base

# Use backend for faster repeated inference
da3 backend --model-dir depth-anything/da3-base
da3 auto path/to/images --export-format glb --use-backend
```

## Model Details

- **Developed by:** ByteDance Seed Team
- **Model Type:** Vision Transformer for Visual Geometry
- **Architecture:** Plain transformer with unified depth-ray representation
- **Training Data:** Public academic datasets only

### Key Insights

πŸ’Ž A **single plain transformer** (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization.  # noqa: E501

✨ A singular **depth-ray representation** obviates the need for complex multi-task learning.

## Performance

πŸ† Depth Anything 3 significantly outperforms:
- **Depth Anything 2** for monocular depth estimation
- **VGGT** for multi-view depth estimation and pose estimation

For detailed benchmarks, please refer to our [paper](https://arxiv.org/abs/2511.10647).  # noqa: E501

## Limitations

- The model is trained on academic datasets and may have limitations on certain domain-specific images  # noqa: E501
- Performance may vary depending on image quality, lighting conditions, and scene complexity


## Citation

If you find Depth Anything 3 useful in your research or projects, please cite:

```bibtex
@article{depthanything3,
  title={Depth Anything 3: Recovering the visual space from any views},
  author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},  # noqa: E501
  journal={arXiv preprint arXiv:2511.10647},
  year={2025}
}
```

## Links

- 🏠 [Project Page](https://depth-anything-3.github.io)
- πŸ“„ [Paper](https://arxiv.org/abs/2511.10647)
- πŸ’» [GitHub Repository](https://github.com/ByteDance-Seed/depth-anything-3)
- πŸ€— [Hugging Face Demo](https://huggingface.co/spaces/depth-anything/Depth-Anything-3)
- πŸ“š [Documentation](https://github.com/ByteDance-Seed/depth-anything-3#-useful-documentation)

## Authors

[Haotong Lin](https://haotongl.github.io/) Β· [Sili Chen](https://github.com/SiliChen321) Β· [Junhao Liew](https://liewjunhao.github.io/) Β· [Donny Y. Chen](https://donydchen.github.io) Β· [Zhenyu Li](https://zhyever.github.io/) Β· [Guang Shi](https://scholar.google.com/citations?user=MjXxWbUAAAAJ&hl=en) Β· [Jiashi Feng](https://scholar.google.com.sg/citations?user=Q8iay0gAAAAJ&hl=en) Β· [Bingyi Kang](https://bingykang.github.io/)  # noqa: E501