AlignGS: Aligning Geometry and Semantics for Robust Indoor Reconstruction from Sparse Views

VCIP 2025

Yijie Gao^*, Houqiang Zhong^*, Tianchi Zhu, Zhengxue Cheng, Qiang Hu^†, Li Song^†

^* co-first authors ^† corresponding authors

Paper Code

Abstract

The demand for semantically rich 3D models of indoor scenes is rapidly growing, driven by applications in augmented reality, virtual reality, and robotics. However, creating them from sparse views remains a challenge due to geometric ambiguity. Existing methods often treat semantics as a passive feature painted on an already-formed, and potentially flawed, geometry. We posit that for robust sparse-view reconstruction, semantic understanding instead be an active, guiding force. This paper introduces AlignGS, a novel framework that actualizes this vision by pioneering a synergistic, end-to-end optimization of geometry and semantics. Our method distills rich priors from 2D foundation models and uses them to directly regularize the 3D representation through a set of novel semantic-to-geometry guidance mechanisms, including depth consistency and multi-faceted normal regularization. Extensive evaluations on standard benchmarks demonstrate that our approach achieves state-of-the-art results in novel view synthesis and produces reconstructions with superior geometric accuracy. The results validate that leveraging semantic priors as a geometric regularizer leads to more coherent and complete 3D models from limited input views.

Method Overview

We present the overview of the AlignGS pipeline, starting with initialization and the subsequent geometry and semantics joint optimization. This framework enables the end-to-end joint optimization of all geometric and semantic attributes of the Gaussian primitives, ensuring a synergistic refinement of both the scene's geometric structure and its semantic understanding.

Quantitative Comparison

We present our quantitave comparison across ScanNet and NRGBD scenes. As shown in the table above, AlignGS demonstrates state-of-the-art performance in novel view synthesis. As shown in the table below, AlignGS produces substantially more accurate and complete meshes, consistently achieving the highest F-score.

Qualitative Comparison

We present a qualitative nvs comparison across ScanNet and NRGBD scenes, showing our rendered RGB and depth from novel viewpoints. Our method produces significantly fewer artifacts and exhibits more coherent geometric structures compared to others.

We present a qualitative mesh comparison across ScanNet and NRGBD scenes, showing our extracted mesh normal alongside key baselines and ground truth. Our method recovers structurally more accurate and higher-fidelity surfaces with improved smoothness on objects and sharper distinction at semantic boundaries.

We present a qualitative downstream editing performance: the left two columns compare our segmentation results from a novel viewpoint with Feature 3DGS; the right two columns demonstrate language-guided editing, including object extraction, deletion, and color highlighting (e.g., for pillows and cushions).

BibTeX

@misc{Gao_2025_VCIP,
        title        = {AlignGS: Aligning Geometry and Semantics for Robust Indoor Reconstruction from Sparse Views},
        author       = {Yijie Gao and Houqiang Zhong and Tianchi Zhu and Zhengxue Cheng and Qiang Hu and Li Song},
        year         = {2025},
        eprint       = {2510.07839},
        archivePrefix = {arXiv},
        primaryClass = {cs.CV},
        url          = {http://arxiv.org/abs/2510.07839}
      }