DriveVGGT: Visual Geometry Transformer for Autonomous Driving

Xiaosong Jia^3,*, Yanhao Liu^1,2,*, Junqi You¹, Renqiu Xia¹, Yu Hong^1,2, Junchi Yan^1,†
¹Shanghai Jiao Tong University
²Shanghai Innovation Institute
³Institute of Trustworthy Embodied AI, Fudan University
^*Equal Contributions ^†Corresponding authors

Abstract

Feed-forward reconstruction has recently gained significant attention, with VGGT being a notable example. However, directly applying VGGT to autonomous driving (AD) systems leads to sub-optimal results due to the different priors between the two tasks. In AD systems, several important new priors need to be considered: (i) The overlap between camera views is minimal, as autonomous driving sensor setups are designed to achieve $360^{\circ}$ coverage at a low cost. (ii) The camera intrinsics and extrinsics are known, which introduces more constraints on the output and also enables the estimation of absolute scale. (iii) Relative positions of all cameras remain fixed though the ego vehicle is in motion.

To fully integrate these priors into a feed-forward framework, we propose DriveVGGT, a scale-aware 4D reconstruction framework specifically designed for autonomous driving data. Specifically, we propose a Temporal Video Attention (TVA) module to process multi-camera videos independently, which better leverages the spatiotemporal continuity within each single-camera sequence. Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings, aiming to establish consistency relationships across different cameras while restricting each token to attend only to nearby frames. Finally, we extend the standard VGGT heads by adding an absolute scale head and an ego vehicle pose head. Experiments show that DriveVGGT outperforms VGGT, StreamVGGT, fastVGGT on autonomous driving dataset while extensive ablation studies verify effectiveness of the proposed designs.

1 Introduction

Refer to caption — Figure 1: Model comparison. Our method can effectively implement multi-camera relative poses with faster speed and better performance.

4D reconstruction is a computer vision task that predicts geometric information from visual sensors [i1, i12, i13]. Compared to other sensors, camera-based reconstruction has been widely researched and implemented as a low-cost solution in various fields, particularly in autonomous driving [li2025think2drive, yang2311llm4drive, jia2021ideli2023delving] and robotics [i2, i14, fan2025interleave]. Generally, there are two types of reconstruction forms. The first one is iterative-based methods, such as [deng2024plgslam, guo2024event]. These methods require selecting specific scenes or objects and proposing iterative reconstruction to achieve optimized results. However, due to the lack of sufficient generalization [i4], iterative-based methods require retraining the model when scenes or objects are changed or modified. The second one is feed-forward methods [i5]. These methods can directly output prediction results without updating any model parameters. The represented model, VGGT, can simultaneously predict 4 geometry tasks in various scenes, which marks a significant breakthrough.

Although feed-forward methods have achieved sufficient generalization, several limitations arise when implementing them in autonomous driving scenes [i6, hu2023planning, jia2023hdgt, wu2301policy]. Initially, for cameras on self-driving vehicles, to keep a balance between FOV and cost [i7], these cameras’ views are always extremely different [jia2023driveadapter, wu2022trajectory, jia2024amp], and the captured images of each camera have low overlap [jia2024bench2drive, you2024bench2driver, yang2025drivemoe]. Thus, it is not easy for a model to recognize similar features and eventually predict an effective image pose relationship [i8, jia2022multi]. Secondly, although the calibration of camera relative poses is easy to obtain in autonomous driving systems [i9, jia2023think, lu2024activead], the camera relative poses cannot be directly implemented in feed-forward methods. Due to the scale difference between the predictions of feed-forward models and real-world relative poses, direct aggregation will cause scale ambiguity among geometry tokens [i11]. Meanwhile, in most previous feed-forward architectures, each image token only contains one camera pose token [vggt], which means that the relative pose cannot be effectively represented.

To sufficiently aggregate camera relative poses in a multi-camera system [jia2025drivetransformer, zhu2024flatfusion, jia2023towards], we propose a multi-camera visual geometry transformer with relative poses to achieve effective fusion between relative pose and geometry tokens from VGGT. The model comprises two components. Firstly, the Temporal Video Attention module is proposed to implement camera-level geometry aggregation among all cameras. As the video’s temporal continuity of each camera, VGGT can effectively process the single-camera video into geometry tokens. The geometry tokens of each image are made of image pose tokens and depth tokens, which will be utilized to predict image pose and depth separately. However, the image pose tokens only represent the relationship between the current image and the first image. Therefore, to establish the pose relationship among all cameras on the vehicle, the Multi-camera Consistency Attention module is proposed to inject relative pose as extra pose tokens for each image. Specifically, we propose a relative pose embedding to normalize real-world camera poses and subsequently align them to the same dimension as geometry tokens. To implement interaction among images from different cameras, we utilize window attention to enhance adjacent multi-camera tokens sequentially. The proposed method outperforms other models on nuScenes dataset [i10], which contains 6 low-overlap cameras on the vehicle. In detail, the proposed methods can achieve better reconstruction results with lower latency.

In conclusion, our contributions are as follows:

1) We propose DriveVGGT, a feed-forward framework, a feed-forward framework to achieve 4D reconstruction for multi-camera systems in autonomous driving. Compared to VGGT, DriveVGGT sufficiently incorporates data priors within AD systems and the unique settings of multi-camera systems. As a result, DriveVGGT achieves faster inference speed and higher prediction accuracy, enabling more efficient and reliable execution of various autonomous driving tasks.

2) We introduce an efficient two-stage pipeline to deal with multi-camera images. Specifically, we propose a Temporal Video Attention (TVA) module to process multi-camera videos independently, which better leverages the spatiotemporal continuity within each single-camera sequence. We propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings, establishing consistency relationships across different cameras while restricting each token to attend only to nearby frames.

3) Extensive experiments on the nuScenes dataset show the superiority of our proposed DriveVGGT, where it outperforms other VGGT-based methods in both inference speed and prediction accuracy.

2 Related Works

2.1 4D Scene Reconstruction

4D Scene reconstruction is widely implemented in autonomous driving [yang2025resim, yang2025raw2drive, yang2025trajectory] and robotics[chen2024single, tang2022point, miao2025advances]. In recent years, feed-forward reconstruction has been widely researched due to their powerful generalization and clear 3D geometry outputs. [dust3r] proposes the first end-to-end 3D reconstruction pipeline to predict images’ poses, intrinsics and depths simultaneously. Furthermore, [monst3r] expands 3D reconstruction from static scenes to dynamic scenes; they additionally predict dynamic masks to achieve better performance for dynamic objects. Subsequently, [megasam] implements the learned movement map to recognize motion probability of various objects in seniors exhaustively. To simplify the pipeline of end-to-end reconstruction, VGGT[vggt] proposes a transformer-based feed-forward method to decode various geometric information from image tokens. After that, [stream] achieves streaming reconstruction with VGGT by storing tokens in a memory cache, which can significantly enhance time inference with temporal causal attention. To further optimize VGGT, fastvggt[fastvggt] applies region-based random sampling to improve inference time, which is especially significant when processing with considerable(1000+) image inputs. Furthermore, [faster] implements block-sparse global attention instead of a fully global one to improve model efficiency.

2.2 Temporal-Spatial Geometry Consistency

Monocular camera geometry consistency mainly focuses on temporal continuity [wang2025diffusion, xue2025human, gao2020using, jia2020sentimem]. [wu2025video] utilizes static point clouds to improve the video generation consistency of the diffusion transformer. In the training process, [zhou2025stable] implements an M-in N-out architecture to recover missing images in temporal sequences. Meanwhile, [you2024nvs] proposes a training-free pipeline to consistently generate high-quality novel views. [baisyncammaster] implements a multi-view synchronization module to control the geometry consistency between two different views. [kuang2024collaborative] implements mask attention to maintain consistency of the overlap region in the generation process. In the autonomous driving field, [lu2024seeing] introduces decoupled attention to achieve efficient interaction and eventually keep temporal-spatial geometry consistency among 6-view cameras.

2.3 Position Geometry Representation

In typical scene reconstruction, estimating 6-D image poses is crucial for achieving better prediction results[hao2025research]. As to the representation of image position, [kong2024eschernet] introduces a novel camera positional encoding method to represent both 4 DoF (object-centric) and 6 DoF camera poses. [xu2024se] implements spherical harmonics as positional encodings, which are specialized equivariant to represent relative poses. [miyatogta] explores a sufficient and effective transformer structure to deal with the specificity of position tokens. [licameras] proposes a novel relative positional encoding technique to process images similar to ROPE[su2024roformer] in attention blocks. When it comes to recent reconstruction works, VGGT[vggt] adds extra image pose tokens to each image and finally decodes as image extrinsics and intrinsics. However, in VGGT, the first frame’s camera pose needs to be initialized separately as a position reference for other frames, which may decrease prediction accuracy when feeding images to the model with different orders. Concerned the above problem, [wang2025pi] introduce a fully permutation-equivariant architecture to eliminate this bias and propose different orders of images at the same level.

3 Proposed Approach

3.1 Overview

We propose DriveVGGT to fully utilize the information of relative camera poses to improve the model performance of geometry tasks, such as camera pose estimation and depth estimation. The model architecture is illustrated in Fig. 2. The model generally consists of three sub-modules. Firstly, the Temporal Video Attention (TVA) module is proposed to extract geometric features from each camera sequence, which contains sequence pose tokens to indicate the position relationship with the first frame of each video, and image tokens to indicate geometric features. Then, the Multi-camera Consistency Attention (MCA) module is proposed to implement multi-camera attention of adjacent images. To overcome the instability of low-overlapped images, we inject relative poses into the attention process to generate a unified geometry representation. Finally, the prediction heads decode the above features as the predictions of relative poses, sequence poses and depths separately.

3.2 Temporal Video Attention

Temporal Video Attention module is proposed to establish the initial geometry relationship among images captured by each camera. These images belong to a streaming video and are easy for feed-forward geometry models (like VGGT) to output effective reconstruction results. Specifically, as to N images, the simplest form of feed-forward geometry transformer is:

f:\{I_{i}\}_{i=1}^{N}\mapsto\{(F_{\mathrm{pos}}(i),F_{\mathrm{depth}}(i))\}_{i=1}^{N}

(1)

where $I(i)$ is the i-th image with resolution $H\times W$ , $f(.)$ is transformer function to process these pictures to tokens. Subsequently, with the help of the decoder head, these tokens can be translated as parctical geometry information as:

\{(Depth(i),Geo(i),...)\}_{i=1}^{N}=Head(f(\{I_{i}\}_{i=1}^{N}))

(2)

where $Depth_{i}$ is the prediction of the depth map, $Geo(i)$ is image extrinsics, which contain 6 dimensions to indicate the rotation and translation of the image in 3D world, and $F_{pos}(i),F_{depth}(i)$ are geometry tokens of each image.

In the case of multi-camera situations, unlike global attention in VGGT, Temporal Video Attention only implements attention for images captured by the same camera. For instance, as to M cameras which capture N images simultaneously, the function TVA module is:

TVA(.)=\{f(.)\}_{i=1}^{M}

(3)

which only aggregate features of each camera, and The outputs of TVA module is:

\{(F^{seq}_{pos}(i,j),F_{depth}(i,j))\}_{(i,j)}=TVA(\{I(i,j)\}_{j=1}^{N}))

(4)

where $F^{seq}_{pos}(i,j)$ indicates that the camera pose tokens only represent the sequential pose prediction results, which are aligned with the first image of each camera separately.

3.3 Relative Pose Embedding

Concerned with the uncertain scale for the final geometry outputs proposed by feed-forward visual geometry models, it is meaningful to pre-process relative poses among all cameras on the car or robot. Initially, to alleviate the scale difference between inputs and outputs, we normalize the translation ( $T$ ) among all cameras as $T_{norm}$ (mean = 0, std = 0.1).

Following the encoder methods of VGGT, we transform the intrinsic and extrinsic into a 10-D vector as:

P_{cam}(j)=Concat(T_{norm}(j),R(j),Intric(j))

(5)

Concerned that the relative poses of cameras (num=M) on a self-driving vehicle are static at any time, we only need to process M camera poses. Then, we pull the $P_{cam}$ to the same dimension of tokens from TVA module, and treat them as geometry information which indicates the relative pose relationship of all cameras on the vehicle as:

F_{pos}^{cam}(j)=Embed(P_{cam}(j))

(6)

3.4 Multi-Camera Consistency Attention

The outputs of TVA module only implement attention among images from the same camera. However, there are two problems in this process. Firstly, the initial image pose of every camera video is set to the same position, which means the relative pose should be estimated to recover the camera pose in the global world. Secondly, the scale of each video is misaligned due to the attention isolation between each camera. To overcome the above problems, the Multi-camera Consistency Attention (MCA) module is proposed to get the unified reconstruction results. The MCA module is illustrated in detail in Fig. 3, which can achieve lower computational complexity for long-sequence attention.

3.4.1 Token Initialization

To optimize tokens from the TVA module, token initialization operation is proposed before implementing attention, which aggregates relative pose tokens to initial tokens from the TVA module. Concerned that the following prediction heads merely utilize tokens from the selected 4 layers, we only extract and process the selected tokens in the MCA module. For each layer, we concat relative pose tokens from relative pose embedding module:

F_{token}(i,j)=Concat(F_{pos}^{cam}(j),F^{seq}_{pos}(i,j),F_{depth}(i,j))

(7)

where i indicates frame index of each video, and j indicates the j-th camera on the vehicle. As the camera on the car is fixed, the relative camera pose of each frame is the same.

3.4.2 Window Attention

Unlike stream-based methods, the global reconstruction optimization can implement attention at all times, past or future[zhang2025advances]. In terms of long-sequence video reconstruction, global attention among all images is redundant and inefficient. Thus, we propose window attention to implement the attention operation on multi-camera images belonging to 3 adjacent time frames:

\{F(i,j)\}_{(i,j)}=Atten^{i}(\{\{F_{token}(i,j)\}_{1}^{M}\}_{i-1}^{i+1})

(8)

where $Atten^{i}$ is the i-th global attention among (i-1)-th,i-th,(i+1)-th tokens. $F_{i}(i,j)$ is the final optimized i-th tokens. For N frame images of each camera sequence, the above attention operation implements N times.

Finally, after the above window attention, all tokens consist of 3 parts: relative pose tokens, sequential pose tokens, and image geometry tokens. Concerning the relative camera poses which are time-invariant, the MCA module finally outputs M relative poses and N sequential poses. Thus, we aggregate the sequential pose token as:

F^{seq}_{agg}(i,j)=Avg(\sum_{j=1}^{M}F^{seq}(i,j))

(9)

and the relative pose token as:

F^{rel}_{agg}(i,j)=Avg(\sum_{j=1}^{N}F^{rel}(i,j))

(10)

3.5 Prediction Heads

Camera Pose Head: Following the VGGT architecture, we reuse the camera head in VGGT to decode relative poses and sequential poses separately. Specifically, the Relative Pose Head outputs the time-invariant relative camera poses, which contain camera relative extrinsic and intrinsic. Meanwhile, the Sequential Pose Head proposes the extrinsic parameters of each video camera. Then the two results are aggregated together and propose the camera pose in global axis:

G^{global}(i,j)=G^{seq}(i,j)\times G^{rel}(i,j)

(11)

where G is the $\begin{pmatrix}R&T\\ 0&1\end{pmatrix}$ , which are decoded from different camera pose head, and $\times$ indicates matrix multiplication.

Depth Head: As VGGT architecture, we use DPT head[chen2021dpt] to decode the geometry tokens of each image to depths. It outputs prediction results of depths and depth confidence maps. The DPT head implements 4 refinement sub-modules to gradually decode high-resolution dense depths from spatial-compressed geometry tokens.

Scale Head: To transform the normalized geometry information to real-world scale, we predict the scale by comparing real-world relative poses and the predicted normalized relative poses as:

Scale=Avg(\sum_{j}\frac{T^{rel}_{real}(j)}{T^{rel}_{norm}(j)})

(12)

where T indicate relative-pose translation from $\begin{pmatrix}R&T\\ 0&1\end{pmatrix}$ . After multiplying the scale to depths and translation of camera extrinsics, the final world points’ prediction can recover to the real-world scale.

3.6 Loss

Following the loss design of VGGT, we implement the depth and camera pose loss to supervise DriveVGGT training process. In general, the total loss is calculated as:

L_{total}=\lambda_{1}L_{depth}+\lambda_{2}L_{rel}+\lambda_{3}L_{seq}

(13)

where $\lambda_{1}$ is 0.1, $\lambda_{2}$ and $\lambda_{3}$ are set as 1.0.

Similar to VGGT, we calculate $L_{depth}$ as 3 parts, which are depth error, depth grad error and uncertainty map error:

	$\displaystyle L_{\mathrm{depth}}=\sum_{i=1}^{N}\sum_{j=1}^{M}\mathopen{}\mathclose{{\left\\|\hat{\Sigma}_{ij}\mathopen{}\mathclose{{\left(\hat{D}_{ij}-D_{ij}}}\right)}}\right\\|+$		(14)
	$\displaystyle\mathopen{}\mathclose{{\left\\|{\hat{\Sigma}}_{ij}\mathopen{}\mathclose{{\left(\nabla\hat{D}_{ij}-\nabla D_{ij}}}\right)}}\right\\|-\alpha\log{\hat{\Sigma}}_{ij}$		(14)

Compared to VGGT, we decouple the camera pose to two kinds of posed, and $L_{rel}$ is calculated as:

L_{\mathrm{rel}}=\sum_{j=1}^{M}\mathopen{}\mathclose{{\left\|{\Sigma}_{j}\mathopen{}\mathclose{{\left(\hat{\textbf{g}}_{j}-\textbf{g}_{j}}}\right)}}\right\|_{\epsilon}

(15)

and $L_{seq}$ is calculated as:

L_{\mathrm{seq}}=\sum_{i=1}^{N}\mathopen{}\mathclose{{\left\|{\Sigma}_{i}\mathopen{}\mathclose{{\left(\hat{\textbf{g}}_{i}-\textbf{g}_{i}}}\right)}}\right\|_{\epsilon}

(16)

where $||.||_{\epsilon}$ using the Huber loss.

Table 1: Multi-camera Pose Estimation on nuScenes dataset.

Method	Relative poses	frame=15		frame=25		frame=35
Method	Relative poses	AUC(30) $\uparrow$	AUC(15) $\uparrow$	AUC(30) $\uparrow$	AUC(15) $\uparrow$	AUC(30) $\uparrow$	AUC(15) $\uparrow$
VGGT [vggt]	✗	0.8531	0.7689	0.7866	0.6719	0.6871	0.5477
StreamVGGT [stream]	✗	0.7005	0.5884	OOM	OOM	OOM	OOM
fastVGGT [fastvggt]	✗	0.8246	0.7191	0.7707	0.6435	0.6830	0.5357
VGGT [vggt]	✓	0.8164	0.7195	0.7403	0.6136	0.6445	0.5002
fastVGGT [fastvggt]	✓	0.7915	0.6764	0.7321	0.5954	0.6477	0.4976
DriveVGGT(VGGT)	✓	0.8635	0.7706	0.8010	0.6778	0.7200	0.5811
DriveVGGT(fastVGGT)	✓	0.8534	0.7498	0.7844	0.6514	0.6995	0.5510

Table 2: Multi-camera Depth Estimation on nuScenes dataset.

Method	Relative poses	frame=15		frame=25		frame=35
Method	Relative poses	Abs rel $\downarrow$	$\delta^{3}\uparrow$	Abs rel $\downarrow$	$\delta^{3}\uparrow$	Abs rel $\downarrow$	$\delta^{3}\uparrow$
VGGT [vggt]	✗	0.3666	0.8791	0.3654	0.8817	0.3605	0.8858
StreamVGGT [stream]	✗	0.3636	0.8811	OOM	OOM	OOM	OOM
fastVGGT [fastvggt]	✗	0.3684	0.8782	0.3693	0.8794	0.3660	0.8825
VGGT [vggt]	✓	0.3718	0.8779	0.3700	0.8805	0.3647	0.8844
fastVGGT [fastvggt]	✓	0.3655	0.8784	0.3691	0.8795	0.3658	0.8826
DriveVGGT(VGGT)	✓	0.3805	0.8747	0.3705	0.8825	0.3601	0.8892
DriveVGGT(fastVGGT)	✓	0.3655	0.8854	0.3601	0.8894	0.3539	0.8935

4 Experiments

4.1 Datasets

The nuScenes dataset consists of various driving scenes . For each scene, nuScenes records 20 seconds with sufficient multimodal information from 6 cameras, 1 lidar, vehicle ego poses, sensor calibration, etc. In our experiment, we primarily use images of relative poses from 6 cameras as the model’s inputs. Similar to previous related works on nuScenes, we use 700 driving scenes for training, and 150 for validation. As to each scene, we use labeled samples recorded at 2Hz for training and testing.

Meanwhile, it is not probable to directly use sparse lidar points to generate the depth map as ground truth. Concerned about this weakness in the nuScenes dataset, we implement two effective steps to generate dense depth maps for training. Firstly, we aggregate multi-frame lidar points to build the point clouds of the entire scene with more detail. For labeled dynamic objects, we use their 3D box from each time step to aggregate their points. Furthermore, after projecting the points onto the depths, we utilize a depth augmentation algorithm to enhance the validity of the depth map. The two steps will bring some noise to the depth ground truth, while they can make it sufficient for training. The two-step data enhancement is illustrated in Fig. 4.

4.2 Implementation Details

For models’ inputs, we decrease the initial image resolution of nuScenes from 1600x900 to 518x280, and implement the same changes to images’ intrinsics during ground truth generation. Then, like VGGT, we propose scale normalization to the depth map, camera poses to keep the scale consistent, while we additionally use the scale for training. We train all models on 8 NVIDIA H200 GPUs and test them on 1 NVIDIA H200 GPU. As to the training process, initially, we randomly input 3-10 frame multi-camera images (18-60 images) from scenes for 20 epochs. Each epoch trains 1000 times with 2e-4 learning rate. Then we freeze the aggregator and fine-tune for another 5 epochs with 1e-5 learning rate. For fair comparisons, we train other models with the same method.

4.3 Pose Estimation

To compare the pose estimation of the proposed methods with other VGGT-based methods, we test VGGT, StreamVGGT, and fastVGGT on the nuScenes datasets. To illustrate the models’ performance on different numbers of images, we set three kinds of image inputs: 15 frames (90 images), 25 frames (150 images), and 35 frames (210 images). Meanwhile, we incorporate relative pose embedding into VGGT and fastVGGT to demonstrate the role of relative poses in these models. Regarding our method, we implement two base geometry transformers to achieve temporal video attention in the TVA module, namely DriveVGGT (VGGT) and DriveVGGT (fastVGGT). The results are shown in Table 1. Initially, DriveVGGT (VGGT) achieves better performance than other methods, especially in scenes with 210 images. Meanwhile, as to the implementation of camera pose embedding, VGGT and fastVGGT exist performance degradation. However, as to DriveVGGT, the aggregation will improve the accuracy of camera pose estimation, which demonstrates the sufficient use of relative poses in DriveVGGT.

4.4 Depth Estimation

The comparison of depth estimation is shown in Table 2. As the evaluation of camera pose estimation, we test VGGT, StreamVGGT, fastVGGT and DriveVGGT on the nuScenes dataset. As to metric Abs Rel, DriveVGGT(fastVGGT) achieves the best depth estimation performance in scene-35 scenes, which indicates its ability to process long sequence multi-camera videos. StreamVGGT outperforms other methods in frame-15 scenes.

4.5 Inference Time Estimation

The comparison of inference time is shown in Table 3. In general, the proposed method achieves faster inference speed in contrast to VGGT and fastVGGT. The inference time of DriveVGGT(VGGT) is only 50% of VGGT’s in frame-35 scenes. Meanwhile, DriveVGGT(fastVGGT)’s speed is lower than DriveVGGT(VGGT), which is due to the extra token aggregation algorithm in fastVGGT, resulting in a delay in inference time when processing fewer images.

Table 3: Multi-camera inference time on nuScenes dataset.

Method	Inference time (ms) $\downarrow$
Method	frames=15	frames=25	frames=35
VGGT [vggt]	2268	5241	9666
StreamVGGT [stream]	6916	OOM	OOM
fastVGGT [fastvggt]	1950	3341	4949
DriveVGGT(VGGT)	1836	3294	4907
DriveVGGT(fastVGGT)	2390	3823	5043

4.6 Visualization

To quantify the comprehensive performance of the proposed method, we compare the visualization results of VGGT, fastVGGT, and DriveVGGT in Fig.5. To generate the final point clouds, we project the depth map to global points with the guidance of the image extrinsic.

We visualize reconstruction results and camera pose outputs of 3 typical vehicle motion states in traffic scenes. We use 30 × 6 images as model input. In the first scene, the reconstruction results achieve great results. However, the camera pose outputs from fastVGGT exhibit slight misalignment compared to the other methods. When it comes to the second scene, although DriveVGGT can maintain stable pose predictions from the first image to the final, VGGT and fastVGGT exhibit terrible performance degradation, especially for images far from the initial image. Meanwhile, the serious pose misalignment causes ambiguous outputs of the point clouds. Regarding the final scene, the camera pose outputs of VGGT and fastVGGT exhibit slight misalignment, whereas DriveVGGT can output a stable relative pose prediction. Meanwhile, DriveVGGT achieves better reconstruction results for surrounding targets, such as vegetation.

4.7 Ablation Experiment

To validate the effectiveness of the proposed components, we conduct an ablation study by removing the proposed modules from DriveVGGT, and the detailed evaluation is presented in Table 4. The baseline module only utilizes the TVA module to implement attention among images in the video. The test results indicate that the baseline could not handle a multi-camera system as the lack of relative pose representation. After adding relative pose embedding, the model can output a correct pose prediction of the multi-camera system.

Table 4: Ablations of different components.

Method	frame=25
Method	AUC(30) $\uparrow$	Abs rel $\downarrow$	Time $\downarrow$ (ms)
baseline(TVA)	0.039	0.3711	2052
+rel pose embed	0.7855	0.3707	2098
DriveVGGT	0.8010	0.3705	3294

To fully evaluate the function of window attention, we test 3 types of window sizes in Table 5. Compared to size-5 and size-7, size-3 can maintain a balance between performance and efficiency.

Table 5: Ablation experiments of different window sizes.

Window size	frame=25
Window size	AUC(30) $\uparrow$	Abs rel $\downarrow$	Time $\downarrow(ms)$
size=3(Ours)	0.8010	0.3705	3294
size=5	0.8033	0.3741	4924
size=7	0.8087	0.3744	7263

Table 6: Depth estimation comparison between least squares method and scale-based method.

Method	frame=15
Method	Abs rel $\downarrow$	$\delta^{3}\uparrow$
least squares method	0.3805	0.8747
scale-based method	0.3666	0.7412

To evaluate the effectiveness of the scale head, we compare the depth prediction results with the ground truth using two alignment methods: the least squares method and the scale-based method. The results are shown in Table 6. The results indicate that the scale prediction can transform depths to a real-world scale. Subsequently, we visualize the real-world scale point clouds and camera extrinsics in Fig. 6. The results indicate that the real-scaled point clouds maintain the similar geometry consistency as the normalized ones.

4.8 Conclusion

In this work, we propose MVGT, a feed-forward reconstruction model specializing in multi-camera geometry predictions. Compared to previous methods, MVGT can effectively utilize relative camera poses to enhance the accuracy of geometric predictions, such as camera pose and depth estimation. Comprehensive evaluations on the nuScenes dataset demonstrate the outperforming performance compared to previous feed-forward methods, while maintaining lower computational consumption.

\thetitle

Supplementary Material

5 Rationale

5.1 The Explanation of DriveVGGT (fastVGGT) Inference Time Degradation (compared with DriveVGGT (VGGT))

In Table 5 of the manuscript, the inference time of DriveVGGT (fastVGGT) is longer than that of DriveVGGT (VGGT), which may cause confusion. The explanation is that in DriveVGGT, we process images belonging to each camera separately (the function of the Temporal Video Attention module). Thus, the VGGT or fastVGGT treats only 1/6 of the images as a batch in the entire multi-camera inputs, and the image number of each batch is 15, 25, 35 (not 90, 150, 210). The inference time of VGGT and fastVGGT for different numbers of images is shown in the following table. The results indicate that VGGT is faster than fastVGGT when dealing with fewer images. That’s why DriveVGGT (fastVGGT) is slower than DriveVGGT (VGGT).

Table 7: Multi-camera inference time on nuScenes dataset.

Method	Inference time $\downarrow$ (ms)
Method	images=6	images=30	images=54	images=150
VGGT	95	461	1027	5241
fastVGGT	537	893	1277	3341

5.2 Token-level DriveVGGT Pipeline

We visualize the DriveVGGT Pipeline details to better illustrate the function of the two proposed modules for each token.

5.3 DriveVGGT Prediction Visualization

We visualize the DriveVGGT outputs to better illustrate the relationship between prediction heads and tokens.

5.4 Scale Head Visualization

To visualize the scaled global points from images, we visualize the scaled results and raw point clouds from the lidar sensor. To achieve better visualization results, we normalize the color of scaled image points. The scale comparison demonstrates that the model can achieve general real-world geometry accuracy with the scale head.

5.5 More Visualization Results