Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

Abstract

Precise 3D environmental mapping is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics.

Approach

Open-Fusion builds an open-vocabulary 3D queryable scene from a sequence of posed RGB-D images. Real-time Semantic TSDF 3D Scene Reconstruction Module: This module takes in a stream of RGB-D images and the corresponding camera pose. It incrementally reconstructs the 3D scene, representing it as a semantic TSDF volume. Open-Vocabulary Query and Scene Understanding Module: In the second module, Open-Fusion accepts open-vocab queries as inputs and provides corresponding scene segmentations in response, which can serve as an eye for language base robot commanding.

ScanNet Dataset

Visualization of Scene

Object Query

Query:

Semantic Query

Using ScanNetV2 20 classes

Concurrent works

NLMap proposes natural language queryable scene representation for task planning with LLM.

VLMaps creates a map with visual-language features for open-vocabulary landmark indexing.

CLIP-Fields trains a scene-specific neural field that embeds vision-language feature.

LERF trains a scene-specific neural field that encodeds CLIP and DINO features for language-based concept retrieval.

ConceptFusion builds open-set 3D maps that can be queried via text, click, image, or audio offline.

BibTeX

@article{kashu2023openfusion,
        title={Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation},
        author={Kashu Yamazaki, Taisei Hanyu, Khoa Vo, Thang Pham, Minh Tran, Gianfranco Doretto, Anh Nguyen, Ngan Le},
        journal={arXiv preprint arXiv:2310.03923},
        year={2023}
    }