Contrastive Gaussian Clustering for Weakly Supervised 3D Scene Segmentation

Castillo, Myrna; Dahaghin, Mahtab; Toso, Matteo; Del Bue, Alessio

doi:10.1007/978-3-031-78347-0_8

Myrna Castillo¹³,
Mahtab Dahaghin¹³,
Matteo Toso¹³ &
…
Alessio Del Bue¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15323))

Included in the following conference series:

International Conference on Pattern Recognition

356 Accesses

Abstract

3D scene segmentation is a crucial task in Computer Vision, with applications in autonomous driving, augmented reality, and robotics. Traditional methods often struggle to provide consistent and accurate segmentation across different viewpoints. To address this, we look at the growing field of novel view synthesis. Methods like NeRF and 3DGS take a set of images and implicitly learn a multi-view consistent representation of the geometry of the scene; the same strategy can be extended to learn a 3D segmentation of the scene that is consistent with the 2D segmentation of an initial training set of input images.

We introduce Contrastive Gaussian Clustering, a novel approach for novel segmentation view synthesis and 3D scene segmentation. We extend 3D Gaussian Splatting to include a learnable 3D feature field, which allows us to cluster the 3D Gaussians into objects. Using a combination of contrastive learning and spatial regularization, our model can be trained on inconsistent 2D segmentation labels, and still learn to generate multi-view consistent masks. Moreover, the resulting model is extremely accurate, improving the IoU accuracy of the predicted masks by $+8\%$ over the state of the art.

Code and trained models are available at https://github.com/MyrnaCCS/contrastive-gaussian-clustering.

M. Castillo and M. Dahaghin—These authors contributed equally to this work.

You have full access to this open access chapter, Download conference paper PDF

View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation Without Manual Labels

3DMV: Joint 3D-Multi-view Prediction for 3D Semantic Scene Segmentation

Keywords

1 Introduction

Reliable and efficient 3D scene segmentation, i.e., the ability to divide the content of a 3D scene into different objects, is a fundamental skill at the core of several computer vision tasks, and it is a prerequisite for autonomous navigation, scene understanding, and for many AR/VR applications [13]. In this work, we propose a general 3D scene segmentation approach based on 3D Gaussian Splatting [16], that only requires 2D images and their segmentation masks as input, without making assumptions on the masks’ consistency across images.

One of the challenges of 3D scene segmentation is the limited availability of annotated 3D scene datasets, as manual annotations are time-consuming [15]. Recent works bypassed this issue by lifting readily available 2D image understanding to 3D space [28], inserting their semantic information into 3D point clouds [14, 29, 30] or NeRFs [17, 19, 40]. These methods have shown that averaging noisy labels across multiple views generates view-independent dense semantic labels [40]. Early approaches relied on a limited range of task-specific labels [7, 36], but the recent introduction of foundational models like CLIP [32] and SAM [18] provide open-vocabulary 2D semantic segmentation labels, which can be used to optimize scene representations [31, 37]. The segmentation masks generated by the foundation models, however, are not always consistent across views, and existing methods require time-consuming pre-processing to enforce cross-view consistency in the training data [37]. In this work, we address this by introducing a model that can be trained on inconsistent 2D segmentation masks, while still learning a 3D feature field consistent across all views.

As exemplified in Fig. 1, our method takes as input a) a set of multi-view images and b) their 2D segmentations, which are not required to be consistent across views. We then use images and masks to train c) a model representing both the visual and geometrical information of the scene, as well as a 3D segmentation feature field. This model can then be used for a wide range of d) downstream tasks exploiting the visual information (novel view synthesis), segmentation information (3D scene segmentation) or on a combination of the two (returning a segmentation mask given a selected point on a rendered view). The optimization of the geometric and visual components can be approached following standard 3D Gaussian Splatting [16], using a rendering loss to optimize the color, position and shape of the 3D Gaussians. To learn the 3D segmentation feature field, we propose extracting information from the inconsistent 2D segmentation masks via contrastive learning. This approach ensures segmentation consistency across all views without requiring changes to the 2D masks themselves. We test the proposed method against related works based on implicit scene representations [17] and 3D Gaussian representations [31, 37], and show through qualitative and quantitative evaluation how our method can match and outperform them. A video outlining the motivation and main results of this paper is available at the project’s page. Our contributions can be summarized as:

A novel approach to embed a 3D feature field in a 3DGS model, enabling simultaneous modelling of the scene’s appearance and segmentation.
A contrastive-learning approach enforcing a multi-view consistent segmentation feature field, even when training on inconsistent segmentation masks.
An approach for 3D scene segmentation by clustering the Gaussians according to the feature field.

2 Related Work

In this section, we provide an overview of the relevant literature on image and scene segmentation, in addition to 3D scene modeling with techniques for novel-view synthesis. For a complete review of scene understanding or semantic segmentation, we refer the reader to [27] and [11], respectively.

Scene Understanding. Scene understanding is a fundamental problem in computer vision, inferring the semantics and properties of all elements in a 3D scene given a 3D model and a set of RGB images [28]. Early approaches train models on ground-truth (GT) 3D labels, focusing on specific tasks like 3D object classification [36], object detection and localization [7] or 3D semantic and instance segmentation [2, 6, 9, 21]. To overcome the limited availability of 3D GT data, subsequent work leverages 2D supervision, by back-projecting and fusing 2D labels to generate pseudo 3D annotations [12] or applying contrastive learning between 2D and 3D features [24, 33]. More recently, large visual language models [4, 18, 32] have allowed to shift from a close-set of predefined labels to an open-vocabulary framework [32], making it possible zero-shot transfer to new tasks and dataset distributions. We also leverage contrastive learning and foundation models, using class-agnostic segmentation masks generated by the Segment Anything Model (SAM) [18]. However, we apply such techniques to a different scene representation - 3D Gaussian Splatting - and combine contrastive loss with other forms of supervision, like spatial regularization from the distance between the Gaussians.

Radiance Fields. Neural Radiance Fields (NeRF) [25] optimize a Multilayer Perceptron (MLP) to represent a 3D scene as a continuous volumetric function that maps position and viewing direction to density and color. NeRF has enabled the rendering of complex scenes, producing high-quality results in novel-view synthesis. Subsequent work have focused on faster training/rendering [1, 26, 39]. An alternative approach to NVS comes from 3D Gaussian Splatting (3DGS) [16], which achieves both competitive training times and real-time rendering at higher image resolution. Unlike NeRF [25], 3DGS foregoes a continuous volumetric representation and instead approximate a scene using millions of 3D Gaussians with different sizes, orientations, and view-dependent colors. One of the advantages of this approach is that it allows for direct access to the radiance field data, enabling to edit the scene by removing, displacing or adding Gaussians [8]. This also allows capturing dynamic scenes, including a time parameter to model the scene’s changes over time [35]; or combining the model in a pipeline with a foundation model to edit the scene from text prompts [10] or select the Gaussians associated with a specific object [37]. Of these methods, Gaussian Grouping is the closest to our application by segmenting the scene into groups of 3D Gaussians. However, this technique relies on a video-tracker to obtain consistent masks IDs across the training images, which also preset the number of instances in the scene.

Scene Understanding in Radiance Fields Representations. Semantic-NeRF [40] extends the implicit scene representation to encode appearance, geometry, and semantics, and generates denoised semantic labels by training over sparse or noisy annotations. Other methods propose to distill image embeddings extracted by a foundation model encoder into a 3D feature field. Distilled Feature Fields (DFF) [34] includes an extra branch that outputs a pixel-aligned feature vector extracted from LSeg [20] or DINO [4]. Unlike DFF, LERF [17] supervises by rendering non pixel-aligned multi-scale CLIP [32] embeddings. Although these techniques locate a wide variety of objects given any language prompt, they may suffer from inaccurate segmentations occasionally caused by objects with similar semantics. More recent methods [5, 31] have used foundational models for grounding language/segmentation features onto the 3D Gaussians. While these methods provide better performance in localization tasks, achieving higher accuracy, their segmentation masks are noisy/patchy. Mingqiao et al. [37] cluster the Gaussians by assigning them a unique identity ID. Though these methods can include instance segmentation features into the scene representation, the number of objects in the scene is predefined, and it requires an additional tracking method to pre-compute the needed multi-view consistent segmentation labels. A similar approach of using contrastive learning to lift inconsistent 2D segmentations into NeRF has also been used in 3D instance segmentation [3]. We show an alternative method to encode identity features into 3D Gaussians, so we can group them into clusters that we can easily extract/remove from the 3D scene.

3 Methodology

In this work, we represent a scene as a collection of 3D Gaussians that jointly model geometry, appearance, and instance segmentation information. Our approach allows high-quality real-time novel view and segmentation synthesis. We empower a 3DGS model to tackle scene understanding downstream tasks by augmenting each 3D Gaussian with a view-independent feature vector. This set of learnable feature vectors is called the 3D feature field. We optimize our 3D feature field to lift inconsistent 2D segmentation masks into 3D space. A post-optimization process is then applied to render multi-view consistent segmentations and to segment the scene into distinct clusters. A comparison between our algorithm and 3DGS is available in the Supplementary Material.

As shown in Fig. 2, our approach takes (a) a set of input images, from which we independently extract (b) inconsistent 2D segmentation masks using a foundation model for image segmentation. Then, we optimize the 3D Gaussians using (c) the original 3DGS loss function [16] that measures the difference between the rendered and ground truth images. Simultaneously, we make use of (e) a contrastive clustering loss to supervise the 3D feature field. This results in (d) a 3D Gaussian scene representation which captures both visual and instance information. To provide more accurate segmentations and speed up training, we introduce (f) a regularization term that enforces the correlation between the distance of Gaussians in Euclidean and the feature space.

In this section, we first review the 3DGS rendering method. Then we discuss the main steps of our pipeline, including rendering and supervising the 3D feature field via contrastive learning.

3.1 Preliminaries on 3D Gaussian Splatting

The 3DGS model represents a scene as millions of 3D Gaussians parameterized by their position $\mu $, 3D covariance matrix $\varSigma $, opacity $\alpha $, and color c. 3DGS represents the view-dependent appearance c by spherical harmonics. These parameters are jointly optimized to render high-quality novel-views. Since 3DGS preserves the properties of differentiable volumetric representations, it requires as input only a set of images and their camera parameters. Initially, 3DGS creates a set of 3D Gaussians using a sparse Structure-from-Motion (SfM) point cloud obtained during camera calibration. To render 3D Gaussians from a particular point of view, the 3DGS starts by projecting these Gaussians onto the image space, a rendering termed “splatting”. Subsequently, 3DGS generates a sorted list of $\mathcal {N}$ Gaussians, ordering them from closest to farthest. The color of a pixel C is then computed by $\alpha $-blending the colors of $\mathcal {N}$ overlapping points:

$$\begin{aligned} C = \sum _{i \in \mathcal {N}} c_i\alpha '_i\prod _{j=1}^{i-1}(1 - \alpha '_{j}). \end{aligned}$$

(1)

The final opacity $\alpha '_i$ is determined by multiplying the learned opacity $\alpha _i$ and the 2D Gaussian. The optimization is done by subsequent iterations that compare the ground-truth images against the corresponding rendered views.

3.2 3D Feature Field

The 3D feature field is a collection of learnable-vectors stored on the 3D Gaussians, that encode the instance segmentation of the scene. We augment each 3D Gaussian with a learnable feature f. Unlike the view-dependent appearance, this feature must remain consistent across all viewing directions. Therefore, instead of computing spherical harmonics coefficients, we extract its component from Gaussians. During training, we randomly initialize the feature vectors and then adjust them to minimize the contrastive clustering error. The optimization of the 3D feature field involves three iterative steps repeated for each training view: 3D feature field rendering; clustering the rendered features following the related GT segmentation map; and back-propagating the contrastive clustering error.

At each iteration, we render an image and its corresponding 2D feature map, following an analogous process to the rendering algorithm described in Sect. 3.1. For each pixel of the desired view, we $\alpha $ blend the features as:

$$\begin{aligned} F = \sum _{i \in \mathcal {N}} f_i\alpha '_i\prod _{j=1}^{i-1}(1 - \alpha '_{j}) \end{aligned}$$

(2)

Contrastive Clustering. As the first step toward scene optimization, SAM automatically generates 2D segmentation masks from the set of input images. Specifically, we deploy SAM’s automatic generation pipeline on each training image $I \in \mathbb {R}^{H \times W}$, resulting in sets of segments $\{m^p \in \mathbb {R}^{H \times W} | p = 1 \dots \mathcal {N}_k\}$. The number of segments per image $\mathcal {N}_k$ is uncapped, allowing the proposed approach to learn as many instances as are present in the scene. As described previously, the 3D feature field optimization is composed of three steps: rendering a 2D feature map for a given view, clustering the rendered features based on the corresponding 2D segmentation masks to compute a contrastive clustering loss, and then updating our 3D feature field accordingly.

Contrastive clustering maximizes the similarity among features within the same segment in the segmentation map, while minimizing it for those from different segments. Given a view p, the cluster $\{f^p\}$ is the set of rendered features of the 2D feature map that belongs to the same segment $m^p$ in the corresponding GT segmentation map, and the mean feature in $\{f^p\}$ is the centroid $\bar{f}^p$. Like Contrastive Lift [3], we adopt a slow-fast contrastive learning strategy where the teacher parameters $\bar{f}^p$ are updated by exponential moving average of the student parameters $\{f^p\}$. Our objective is to minimize the following loss function:

$$\begin{aligned} \mathcal {L}_{CC} = - \frac{1}{\mathcal {N}_k}\sum _{p=1}^{\mathcal {N}_k}\sum _{q=1}^{|\{f^p\}|} \log {\frac{\exp {\left( f_q^p \cdot \bar{f}^p {/} \phi ^p \right) }}{\sum _{s=1}^{\mathcal {N}_k} \exp {\left( f_q^p \cdot \bar{f}^s {/} \phi ^s \right) }}}, \end{aligned}$$

(3)

where, $f_q^p$ are features in $\{f^p\}$. The concentration estimation of the p-th cluster is $\phi ^p$. Similar to [38], we define it as: $\sum _{q=1}^{\mathcal {N}_p} \Vert f_q^p - \bar{f}^p \Vert _2 {/} \mathcal {N}_p\log \left( \mathcal {N}_p+\epsilon \right) $, where $\mathcal {N}_p = |\{f^p\}|$ and $\epsilon =100$. $\phi $ is used to balance the cluster size and variance, and is small if the number of pixel-feature elements is high and the average distance between its elements and the centroid is small. The smooth parameter $\epsilon $ is needed to avoid excessively large $\phi $. Rather than regularize the features by including a normalization loss, we apply $\ell _2$-normalization to each feature in the rendered feature map before the loss computation.

Spatial-Similarity Regularization. An easy way to obtain 3D instance segmentation is to cluster similar features. However, we occasionally observe sparse outliers (Gaussians misclassified) in regions where the scene is not well observed. Furthermore, we notice that constant failures in the 2D segmentation (e.g., a chair that is inconsistently segmented in two parts: legs and seat) may induce to inaccurate segmentation masks.

To address these issues, we include spatial-similarity regularization to enforce spatial continuity of the feature vectors, encouraging adjacent 3D Gaussians to have similar segmentation feature vectors while discouraging faraway Gaussians from having the same segmentation features. The regularization function is computed with $\mathcal {M}$ sampling Gaussians:

$$\begin{aligned} \mathcal {L}_{regularization} = \frac{\lambda _{near}}{\mathcal {M}\mathcal {K}} \sum _{j=1}^{\mathcal {M}} \sum _{i=1}^{\mathcal {K}} H \left( 1 - f_j \cdot f_i \right) + \frac{\lambda _{far}}{\mathcal {M}\mathcal {L}} \sum _{j=1}^{\mathcal {M}} \sum _{i=1}^{\mathcal {L}} H \left( f_j \cdot f_i \right) , \end{aligned}$$

(4)

where H denotes the sigmoid function. We compute the cosine similarity of features for the closest $\mathcal {K}=2$ and the farthest $\mathcal {L}=5$ Gaussians. Empirically, we found $\lambda _{near} = 0.05$ and $\lambda _{far} = 0.15$ to yield the best result.

Loss Function. The losses defined in this section are combined in a total loss:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{rendering} + \lambda _{clustering}\mathcal {L}_{CC} + \mathcal {L}_{regularization}, \end{aligned}$$

(5)

where $\mathcal {L}_{rendering}$ is the original rendering loss of 3DGS. Empirically, we set $\lambda _{clustering}=1 \times 10^{-6}$. See Sect. 4.4 for ablation of these parameters.

4 Experiments

We aim to segment objects within a scene into distinct clusters, to generate novel segmentation masks from any viewpoint of the scene. We therefore compare our algorithm against recent work for scene understanding, which code has already been published. Specifically, we compare our approach against three relevant competitors: LERF [17], Gaussian Grouping [37], and LangSplat [31]. LERF is an open-vocabulary localization method that embeds a language field within a NeRF by grounding CLIP embeddings extracted at multiple scales over the training images. Given a text query, LERF predicts 3D regions with the semantic content pertinent to the input query. The recent Gaussian Grouping [37] is a technique for classifying 3D Gaussians into predefined instances, and LangSplat [31] is an approach that results in a collection of 3D Language Gaussians, such as LERF, outputs a relevancy map for a given text. We evaluate the performance using two metrics: the mean intersection over union (mIoU), which measures the overlap of the GT and rendered masks; and the mean boundary intersection over union (mBIoU), which evaluates contour alignment between predicted and ground truth masks. In both cases, we report the average performance over all test views and text prompts.

In this section, we first provide details about the datasets used to evaluate the models (Sect. 4.1), then give some implementation details (Sect. 4.2) and report the segmentation performance of the models (Sect. 4.3). Finally, we discuss the advantages of a spatial-similarity regularization loss (Sect. 4.4).

4.1 Datasets

We evaluate the chosen models on two datasets containing indoor and outdoor scenes: the LERF-Mask dataset [37] and the 3D-OVS dataset [22].

LERF-Mask. The LERF-Mask dataset is composed of three manually annotated scenes from the LERF-Localization dataset [17]. These scenes belong to the “posed long-tailed objects” of LERF-Localization, which are scenes containing multiple objects with low search volume and low competition, arranged on a plane, like a set of objects arranged on a small table (“Figurines”). These scenes are captured using the Polycam application on an iPhone, utilizing its onboard SLAM to obtain the camera poses.

3D-OVS. We also report quantitative and qualitative results on five scenes of the 3D-OVS dataset [22], which also consists of a set of long-tail objects, such as toys and everyday objects on a “Bed” or on a “Sofa”.

4.2 Implementation Details

The models evaluated in this section are supervised on segmentation masks automatically generated with the ViT-H SAM model, trained on the SA-1B dataset [18]. These masks are used to learn feature vectors in $\mathbb {R}^{16}$ for each Gaussian. To ensure a stable training process, the loss terms of Eq. (5) are applied with different frequencies: the standard 3DGS loss, used to optimize the geometrical and appearance aspects of the scene, is used at every training iteration. The contrastive clustering loss every 50 iterations, and the spatial-similarity regularization every 100 iterations. Moreover, to reduce the size of the problem and make the loss more stable, we evaluate the clustering loss only on clusters composed by more than 100 features. The optimization of a single scene takes approximately 30k iterations on an NVIDIA 4090 GPU, which amounts to approximately 20 min. The trained model then can render a novel segmentation mask in 0.005 seconds; comparing this with the time necessary to run ViT-H SAM on an image (5.1 sec), this highlights the advantage of the proposed method.

Instance Segmentation. After optimization, the model can be used for Object Selection, as exemplified in Fig. 1; given one calibrated image, we want to find the segmentation mask associated with a given selected pixel. Given a 2D pixel location in the image, we obtain a discriminative feature, i.e.the rendered feature vector at that pixel’s location. We then generate a 2D similarity map $S_C$ by rendering segmentation features for all pixels of the image, and evaluating their cosine similarity to the discriminative vector. Each pixel of the view $(u, v) \in I$ is then categorized as part of the object of interest or not. Pixels with cosine similarity greater than a fixed threshold t (empirically chosen as $t = 0.7$) are classified as part of the object; otherwise, they are not. The segmentation mask $M_{OBJ}$ is defined as:

$$\begin{aligned} M_{OBJ}(u, v) = {\left\{ \begin{array}{ll} \text{1, } & \text{ if } S_C(u, v) \ge t \\ \text{0, } & \text{ otherwise } \end{array}\right. } \end{aligned}$$

(6)

We note that this process can be applied in parallel to multiple objects, by extracting a set of discriminative features at different locations. An analogous approach also allows the 3D segmentation of the scene, by selecting one or more Gaussians and extracting, for each, all Gaussians with a high similarity score.

Semantic Segmentation. The proposed model renders novel feature maps by projecting and blending the content of the 3D feature field on an image plane. To compare these against the ground truth mask, we follow this procedure: i) we select a text prompt related to the content of the scene; ii) we feed into Grounding DINO [23] an (Image, Text) pair which provides a bounding box that we use to generate a segmentation mask by using it as a prompt to SAM; iii) we sample the rendered feature map associated to a pixel within the segment, and iv) use it as a discriminative feature, generating the object’s segmentation in an arbitrary view by selecting all pixels whose rendered-feature vector is falls within a predefined threshold from the discriminative feature.

4.3 Evaluation on Features

First, we compare the performance of our Contrastive Gaussian Clustering against its competitors. We report the average performance on each scene, but a complete breakdown of the performance on each object is available in the Supp.Mat.

Table 1. Comparison of semantic segmentation on LERF-Mask dataset. We report the mIoU and mBIoU (higher is better). LERF-Mask dataset contains accurate segmentation masks that we use to evaluate our segmentation performance.

Full size table

Table 2. Comparison of semantic segmentation on 3D-OVS dataset, on scenes with sparse long-tail objects and simple background. We report the mIoU (higher is better).

Full size table

As shown in Table 1, our method significantly outperforms the other approaches on both metrics, providing on average a $+43 \%$ accuracy than LERF, $+36 \%$ accuracy than LangSplat, and $+8 \%$ accuracy than Gaussian Grouping on average. Regarding the boundary quality of the masks, we outperform on average the competitors by $48 \%$, $37 \%$, and $9 \%$. Though Gaussian Grouping achieves better performance on Ramen, we suggest looking at Fig. 3, in which we show how our method produces better qualitative results, with more accurate segmentations.

When we test the models on the 3D-OVS, the performance is comparable with the previous experiments, as shown in Table 2. Here, our method outperforms the competitors only on two out of five scenes. This is due to two types of error in the training data masks: type I) incorrect object localization by Grounding DINO, and type II) incorrect object segmentation by SAM. For example, the average accuracy on Sofa is low because of two outlier objects: object Pikachu, with a completely incorrect segmentation (Type I error), and grey sofa, that in most training views is detected as two objects (Type II error). However, on average our model achieves the best performances; we outperform LERF on all scenes; and when we perform worse than Gaussian Grouping or LangSplat the performance gap on the mIoU is small: $2.1\%$ on Bed, 0.6 on Sofa, and 4.7 on Lawn.

Of the competitor models, Gaussian Grouping is the one that achieves the closest performance to us. The main limitation of this method is that, while it also enforces multi-view consistency, it does so through preprocessing, requiring that the 2D segmentation masks are made consistent. However, errors in this process propagate to the model, resulting in worse performance. In contrast, our model is not affected by this problem, as it autonomously learns to enforce consistency across the various views. The limited performance of LangSplat is instead due to its embedding in the image semantic features, embedded as 3-dimensional vector, without having a mechanism to ensure no two segments have similar features; this results in noisy segmentation masks and misdetections. This does not happen in our method, since the contrastive clustering loss ensures features from different segments are far in feature space.

Finally, Fig. 3 provides a qualitative comparison of the methods. We can see that the resulting segmentation masks are compatible with the numerical results, showing how our method produces qualitatively better instance segmentations than our competitors. Additional results showing the qualitative performance on 3D segmentation are available in the Supplementary Material.

4.4 Ablation Studies

In the previous experiments, we have claimed that the advantage of our method and, to a lesser extent, of Gaussian Grouping on the other methods is due to the implicitly learned multi-view consistency; which, in our case, is enforced through the loss of Eq. (5). To validate this assumption, we run an ablation test comparing the performance of our model with and without spatial-similarity regularization. The results, reported in Table 3, show that in most scenes the spatial-similarity loss results in a significant performance improvement, on average of $78.8 \%$ against $80.3\%$. This is also supported by the qualitative results on 3D segmentation reported in Fig. 4.

Table 3. An ablation study of our model. In this experiment, we explore the impact of the spatial-similarity regularization loss on the segmentation quality. Metrics are averaged over all the test views.

Full size table

Finally, we validate the choice of hyperparameters by studying their effect on segmentation accuracy. Figure 5 shows how setting the instance segmentation threshold to $t = 0.7$ maximizes performance on all scenes. In Table 4 we instead report the average performance when perturbing each of the hyperparameters of the loss function defined in Eq. (5).

Table 4. Segmentation accuracy as a function of the hyperparameters of Eq. 5.

Full size table

5 Conclusions

In this paper, we introduce Contrastive Gaussian Clustering, a novel approach for 3D scene segmentation. We have shown how, by implicitly enforcing a contrastive clustering loss, we are able to learn consistent segmentation features from an inconsistent set of 2D segmentation masks. This means that the proposed model can learn from automatically generated segmentation masks, with little to none preprocessing required. Moreover, the use of a spatial-similarity regularization ensures that the features learned for Gaussians corresponding to different 3D clusters are distinct enough to provide accurate 3D segmentation. The combination of such two losses results in an efficient and accurate model that outperforms current approaches based both on NERF and 3DGS.

Limitations. Although the results reported in the paper are very promising, including additional information involves some trade-off. Foremost, the use of the two additional losses involves a computational overhead with respect to standard 3DGS, requiring on average $100 \%$ longer time to train. We can, however, reduce this by only applying the losses every 50/100 iterations, respectively. Moreover, the additional information stored in the Gaussians requires larger memory capacity; future works will consider more efficient ways of including the identity information into the scene representation. Other limitations are inherited from SAM and Grounding DINO. For example, to select all Gaussians matched to a given semantic label, we rely on Grounding DINO to select that object’ location in a reference image. However, if this location is wrong, it will not be possible to recover the correct mask. The model’s performance is also limited by the accuracy of the 2D segmentations used in training. We observe that, if multiple views contain incorrect masks, this can result into multiple instances being clustered together.

Future Works. We will expand the proposed approach, integrating it with LLM for language interaction, and extending the feature field to also include hierarchical segmentations. Future work will explore more advanced ways of contrastive clustering. Concerning our multi-view contrastive loss, in future work we could explore more intelligent ways to contrast all the feature-objects.

References

Barron, J.T., et al.: Mip-NeRF 360: unbounded anti-aliased neural radiance fields. In: CVPR (2022)
Google Scholar
Behley, J., et al.: SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. ICCV (2019)
Google Scholar
Bhalgat, Y., Laina, I., Henriques, J.F., Zisserman, A., Vedaldi, A.: Contrastive lift: 3D object instance segmentation by slow-fast contrastive fusion. In: NeurIPS (2023)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Google Scholar
Cen, J., et al.: Segment any 3D Gaussians. arXiv preprint arXiv:2312.00860 (2023)
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. 3DV (2017)
Google Scholar
Chen, D.Z., Chang, A.X., Nießner, M.: Scanrefer: 3D object localization in RGB-D scans using natural language. In: ECCV (2020)
Google Scholar
Chen, G., Wang, W.: A Survey on 3D Gaussian Splatting (2024)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: CVPR (2017)
Google Scholar
Fang, J., Wang, J., Zhang, X., Xie, L., Tian, Q.: GaussianEditor: editing 3D Gaussians delicately with text instructions. arXiv preprint arXiv:2311.16037 (2023)
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Garcia-Rodriguez, J.: A review on deep learning techniques applied to semantic segmentation (2017)
Google Scholar
Genova, K., et al.: Learning 3D semantic segmentation with only 2D image supervision. 3DV (2021)
Google Scholar
Hou, J., Dai, A., Nießner, M.: 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In: CVPR (2019)
Google Scholar
Hu, Q., et al.: Randla-net: efficient semantic segmentation of large-scale point clouds. In: CVPR (2020)
Google Scholar
Hua, B.S., Pham, Q.H., Nguyen, D.T., Tran, M.K., Yu, L.F., Yeung, S.K.: SceneNN: a scene meshes dataset with annotations. In: 3DV (2016)
Google Scholar
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. (2023)
Google Scholar
Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: LERF: language embedded radiance fields. In: ICCV (2023)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Kundu, A., et al.: Panoptic neural fields: a semantic object-aware neural scene representation. In: CVPR (2022)
Google Scholar
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022)
Google Scholar
Liao, Y., Xie, J., Geiger, A.: KITTI-360: a novel dataset and benchmarks for urban scene understanding in 2D and 3D. TPAMI (2023)
Google Scholar
Liu, K., et al.: Weakly supervised 3D open-vocabulary segmentation. In: NeurIPS (2023)
Google Scholar
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, Y., Fan, Q., Zhang, S., Dong, H., Funkhouser, T.A., Yi, L.: Contrastive multimodal fusion with tupleinfonce. ICCV (2021)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Google Scholar
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (2022)
Google Scholar
Naseer, M., Khan, S., Porikli, F.: Indoor scene understanding in 2.5/3D for autonomous agents: a survey. IEEE Access (2019)
Google Scholar
Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.: OpenScene: 3D scene understanding with open vocabularies (2023)
Google Scholar
Qi, C.R., et al.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)
Google Scholar
Qi, C., Su, H., Mo, K., Guibas, L.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Google Scholar
Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: LangSplat: 3D language Gaussian splatting (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Sautier, C., Puy, G., Gidaris, S., Boulch, A., Bursuc, A., Marlet, R.: Image-to-lidar self-supervised distillation for autonomous driving data. In: CVPR (2022)
Google Scholar
Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing NeRF for editing via feature field distillation. In: NeuIPS (2022)
Google Scholar
Wu, G., et al.: 4D Gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023)
Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes. In: CVPR (2015)
Google Scholar
Ye, M., Danelljan, M., Yu, F., Ke, L.: Gaussian grouping: segment and edit anything in 3D scenes. arXiv preprint arXiv:2312.00732 (2023)
Ying, H., et al.: Omniseg3D: Omniversal 3D segmentation via hierarchical contrastive learning (2023)
Google Scholar
Yu, A., Fridovich-Keil, S., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: radiance fields without neural networks (2021)
Google Scholar
Zhi, S., Laidlow, T., Leutenegger, S., Davison, A.J.: In-place scene labelling and understanding with implicit scene representation. In: ICCV (2021)
Google Scholar

Download references

Acknowledgments

This project has received funding from the European Union’s Horizon research and innovation programme under grant agreement No 101079116 and No 101079995.

Author information

Authors and Affiliations

Pattern Analysis and Computer Vision (PAVIS), Istituto Italiano di Tecnologia (IIT), Genoa, Italy
Myrna Castillo, Mahtab Dahaghin, Matteo Toso & Alessio Del Bue

Authors

Myrna Castillo
View author publications
You can also search for this author in PubMed Google Scholar
Mahtab Dahaghin
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Toso
View author publications
You can also search for this author in PubMed Google Scholar
Alessio Del Bue
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Myrna Castillo .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, West Bengal, India
Umapada Pal

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 14843 KB)

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Castillo, M., Dahaghin, M., Toso, M., Del Bue, A. (2025). Contrastive Gaussian Clustering for Weakly Supervised 3D Scene Segmentation. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15323. Springer, Cham. https://doi.org/10.1007/978-3-031-78347-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-78347-0_8
Published: 02 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78346-3
Online ISBN: 978-3-031-78347-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Contrastive Gaussian Clustering for Weakly Supervised 3D Scene Segmentation

Abstract

Similar content being viewed by others

View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation Without Manual Labels

3DMV: Joint 3D-Multi-view Prediction for 3D Semantic Scene Segmentation

Keywords

1 Introduction

2 Related Work

3 Methodology