LVT: Large-Scale Scene Reconstruction via Local View Transformers

Accepted at SIGGRAPH Asia 2025

*Denotes Equal Contribution

1Google       2Northeastern University



Abstract

Large transformer models are proving to be a powerful tool for 3D vision and novel view synthesis. However, the standard Transformer's well-known quadratic complexity makes it difficult to scale these methods to large scenes. To address this challenge, we propose the Local View Transformer (LVT), a large-scale scene reconstruction and novel view synthesis architecture that circumvents the need for the quadratic attention operation. Motivated by the insight that spatially nearby views provide more useful signal about the local scene composition than distant views, our model processes all information in a local neighborhood around each view. To attend to tokens in nearby views, we leverage a novel positional encoding that conditions on the relative geometric transformation between the query and nearby views. We decode the output of our model into a 3D Gaussian Splat scene representation that includes both color and opacity view-dependence. Taken together, the Local View Transformer enables reconstruction of arbitrarily large, high-resolution scenes in a single forward pass.


Overview of the Local View Transformer: Instead of the quadratic self-attention operation typically used in transformers, LVT processes information within local neighborhoods relative to each view, based on the insight that spatially nearby views are generally more informative about the local scene structure than distant ones. Cascading these LVT blocks increases the effective receptive field. Our model takes a set of input views representing the entire scene and, in a single feed-forward pass, outputs a Gaussian splat scene representation which can then be rendered to various target cameras. LVT scales linearly, rather than quadratically, with the number of input images, and generalizes to varied sequence lengths and out-of-distribution camera trajectories.



How it works


LVT takes input view sequences and the corresponding local ray maps, and patchifies them to obtain input tokens. Within each LVT block, the tokens from a window of w neighbor views (here w=3 for illustration) relative to a query view are consolidated and updated with the corresponding relative transformation embeddings. The tokens from the query view then selectively attend to these neighbor tokens (instead of attending to all input tokens, as full self-attention does). This LVT block is repeated 24 times. The processed tokens are then unpatchified and decoded to pixel-aligned Gaussian splat parameters. The combined Gaussian splats are rendered using the 3DGS renderer and compared to ground truth during training.




Video Result Comparisons

LVTSH
LVTSH
LVTSH
LVTSH
LVTSH


Image Result Comparisons

We train and evaluate our methods, LVTSH-rgba and LVTbase, on DL3DV - a large-scale dataset of 4K resolution videos captured from bounded and unbounded real-world scenes for benchmarking novel view synthesis, and RealEstate10K. LVT surpasses prior methods to achieve state-of-the-art novel view synthesis on large scenes.
We additionally perform zero-shot evaluation on Tanks&Temples (the train and truck scenes) and Mip-NeRF360 using LVT trained on DL3DV dataset. Despite not being trained on Tanks&Temples, LVTSH-rgba surpasses 3DGS in LPIPS, PSNR and SSIM, and performs competitively with 3DGS on Mip-NeRF360.


View result comparison on DL3DV.

View result comparison on RealEstate10K.

View result comparison on Tanks&Temples.

View result comparison on Mip-NeRF360.

Ablation Study

We ablate our design choices and show corresponding video/image comparisons.

Impact of view-dependent opacity: Adding spherical harmonics on the opacity (in addition to color) significantly enhances the modeling of thin structures and reflective surfaces.

with view-dependent opacity and color
with only view-dependent color
with view-dependent opacity and color
with only view-dependent color
with view-dependent opacity and color
with only view-dependent color

Impact of mixed-resolution training: Our model trained with the mixed-resolution training strategy surpasses the model trained with a single input resolution. Training the mixed-resolution model on longer sequences including lower resolutions enables it to develop a more comprehensive understanding of scene structure, that it then can transfer to higher resolutions.

Impact of mixed-resolution training

For more ablations, refer to the paper.


Acknowledgments

The authors would like to thank Stephen Lombardi, Clement Godard, Tiancheng Sun and Yiming Wang for their invaluable feedback and discussions throughout the course of this research. We are also thankful to Peter Hedman, Daniel Duckworth, Ryan Overbeck, and Jason Lawrence for their valuable feedback and guidance for preparing this manuscript. Tooba Imtiaz was partly supported by NIH Graduate Research Fellowship under Grant No. 5U24CA264369-03. Website template is adopted from Quark.


BibTeX

                Coming soon.