Sub-scene segmentation using constraints based on Gestalt principles

https://doi.org/10.1016/j.jvcir.2014.02.017Get rights and content

Highlights

  • Sub-scenes are more integrated and semantically consistent regions.

  • Proximity grouping is formulated more appropriately using influence areas.

  • Optimal segmentation is achieved by a self-determined optimal retrieval strategy.

  • The method ignores unimportant details and results in a more integrated segmentation.

Abstract

In this paper, an unsupervised sub-scene segmentation method is proposed. It emphasizes on generating more integrated and semantically consistent regions instead of homogeneous but detailed over-segmented regions usually produced by conventional segmentation methods. Several properties of sub-scenes are explored such as proximity grouping, area of influence, similarity and harmony based on psychological principles. These properties are formulated into constraints that are used directly in the proposed sub-scene segmentation. A self-determined approach is conducted to get the optimal segmentation result based on the characteristics of each image in an unsupervised manner. The proposed method is evaluated over three datasets. For quantitative evaluation, the performance of the proposed method is on par with state-of-the-art unsupervised segmentation methods; for qualitative evaluation, the proposed method handles various sub-scenes well, and produces neater results. The sub-scenes segmented by the proposed method are generally consistent with natural scene categories.

Introduction

Image segmentation aims to partition an image into non-overlapping homogeneous regions and is fundamental for all kinds of image processing and computer vision applications such as object and saliency detection [1], [2], semantic annotation [3], [4], event detection [5], and hierachical scene understanding [6].

Despite years of research, image segmentation remains a very challenging problem because it is inherently an ill-posed and ambiguous problem [7]. There are various possibilities to perceive and segment an image because people have different preferences. Besides, the “correct” segmentation may be different according to different visual tasks. To address the problem of ambiguity in segmentation, Arbelaez et al. proposed to collect human labeled boundaries as ground truth and perform the segmentation in a supervised manner [8]. Following this trend, the supervised methods usually emphasize on estimating the boundary probabilities rather than achieving integrated regions [9], [10]. The drawback is that the boundary may be in a discontinuity state and the disjointed edges affect visual perception when closed contours are preferred [11].

It is of course more challenging and demanding in unsupervised image segmentation. As the general purpose of unsupervised image segmentation is to derive segments which are suitable for human perception, relying on human perceptual rules from psychology is inevitably one of the major directions. Perceptual rules have been carefully studied and are used in many unsupervised segmentation methods [7], [12], [13], [14], [15], [16]. The most widely used is the Gestalt principles [17]. Gestalt is a psychology term that means unified whole. It refers to the theory which describes how people tend to group visual elements when certain principles are fulfilled. It concludes principles such as continuity, closure, similarity and proximity. However, there are difficulties to quantize them in mathematics since these principles are abstract psychology concepts. Actually, only a few principles such as similarity and proximity are used in literature, and they are interpreted in a simplified way. According to the similarity principle, regions with the most similar appearances are considered to be merged [7], [12], [13], [14], [15], [16]. In realizing of the proximity principle, only neighboring regions are actually merged [6], [16], [18]. It is necessary to accomplish the perceptual rules more deeply to further improve image segmentations.

Besides, it has long been identified that there is a “semantic gap” between the segmented patches and the semantic entities that can be readily used. Both Malisiewicz and Efros [1] and Jianping et al. [6] stated that homogenous segmented patches may not correspond to physical objects in the real world. The fundamental reason for this semantic gap roots in the limitation of current objectives of image segmentation which focuses on detecting precise boundaries [10] and producing homogenous regions. Therefore, any slight change in the image is captured and objects that are segmented into several parts are acceptable. However, these parts are needed to be integrated together to meet human expectations. Use global image context or apply corresponding perceptual rules to piece together the segmented parts and form semantically consistent regions is necessary [4], [6].

In this paper, an unsupervised sub-scene segmentation method is proposed to narrow the semantic gap. The notion of the sub-scene is intuitively derived from human perception towards a scene. When a person sees a scene, he may partition the scene into several sub-scenes, where the sub-scene fulfills certain “function” and the meaning of the entire scene is probably derived by combining the functions of the sub-scenes. The notion of sub-scene used in this paper may appear to be similar to semantic segmentation such as [19], [20]. However, the major difference is that the sub-scene here is not confined to fixed categories set beforehand and it does not need to go through a training step neither. Several perceptual rules are explored based on human psychology such as proximity grouping, area of influence by objects and harmony, and they are transformed into constraints which can be applied to low level features. With a self-determined retrieval approach, sub-scenes can be generated automatically. The contributions of the proposed method are:

  • 1.

    Proximity grouping is formulated more appropriately using influence areas instead of being restricted to neighboring pairs;

  • 2.

    Balancing between proximity grouping and similarity grouping is achieved by a self-determined optimal retrieval strategy; and

  • 3.

    The unimportant details are ignored and a more integrated segmentation result is achieved.

The paper is organized as follows. In Section 2, the proposed method is presented in details. Section 3 describes the experiments on three datasets, where each dataset emphasizes a different aspect of scenes. Comparison and discussion are given for each one of them. Section 4 concludes the paper.

Section snippets

Problem formulation

Consider I as a given image, one way to partition the image into M regions is ΓM(I)=RMi,i=1,,M, where RMi represents region i. The common split-and-merge approach towards image segmentation is to first generate a number of superpixels and then gradually merge them until a stop criterion is satisfied; or complete the merging steps to the end as ΓM(I)ΓM-1(I)Γ1(I) and then select the optimal segmentation Γ(I) from the entire process. The optimal Γ(I) is the partition that minimizes the cost

Experiments and results

The experiments are conducted on three datasets in order to evaluate our proposed method thoroughly. The first dataset is the Berkeley segmentation dataset [29]. This dataset is widely used in the research community that allows us to compare our results with the other state-of-the-art methods. However, it may not so suitable for sub-scene evaluation because there are many scenes in the dataset with close-up shots. Thus, the indoor scene dataset [30] and the Stanford background dataset [31] have

Conclusions

In this paper, a new method of sub-scene segmentation is proposed. The sub-scene segments are meaningful entities which ignore unimportant details compared to conventional segmentation results. The unsupervised sub-scene segmentation is conducted by using properties including proximity grouping, area of influence, similarity and harmony which are explored based on psychological principles. These properties are formulated into constraints and a self-determined optimal retrieval is conducted to

References (35)

  • T. Malisiewicz, A.A. Efros, Improving spatial support for objects via multiple segmentations, in: British Mashine...
  • S. Goferman, L. Zelnik-Manor, A. Tal, Context-aware saliency detection, in: IEEE Transactions on Pattern Analysis and...
  • J. Shotton, M. Johnson, R. Cipolla, Semantic texton forests for image categorization and segmentation, in: Computer...
  • C. Xi, A. Jain, A. Gupta, L. S. Davis, Piecing together the segmentation jigsaw using context, in: Computer Vision and...
  • L. Li-Jia, F.-F. Li, What, where and who? Classifying events by scene and object recognition, in: International...
  • F. Jianping, G. Yuli, L. Hangzai, R. Jain, Mining multilevel image semantics via hierarchical classification, in: IEEE...
  • B. Peng, L. Zhang, D. Zhang, A survey of graph theoretical approaches to image segmentation, in: Pattern Recognition,...
  • P. Arbelaez, Boundary extraction in natural images using ultrametric contour maps, in: Computer Vision and Pattern...
  • J. Mairal, M. Leordeanu, F. Bach, M. Hebert, J. Ponce, Discriminative sparse image models for class-specific edge...
  • P. Arbelaez, M. Maire, C. Fowlkes, J. Malik, Contour detection and hierarchical image segmentation, in: IEEE...
  • M. Yansheng, L. Hongdong, H. Xuming, Connected contours: a new contour completion model that respects the closure...
  • B. Peng, L. Zhang, D. Zhang, Automatic image segmentation by dynamic region merging, in: IEEE Transactions on Image...
  • G. Kootstra D. Kragic, Fast and bottom-up object detection, segmentation, and evaluation using Gestalt principles, in:...
  • H. Yu, X. Zhang, S. Wang, B. Hou, Context-based hierarchical unequal merging for SAR image segmentation, in: IEEE...
  • A. Ion, J. Carreira, C. Sminchisescu, Image segmentation by figure-ground composition into maximal cliques, in:...
  • D. Comaniciu P. Meer, Mean shift: a robust approach toward feature space analysis, in: IEEE Transactions on Pattern...
  • R. Kowalski et al.

    Psychology

    (2009)
  • Cited by (5)

    • A novel visual saliency detection method for infrared video sequences

      2017, Infrared Physics and Technology
      Citation Excerpt :

      It refers to the theory which describes how people tend to group visual elements when certain principles are fulfilled. It concludes principles such as proximity, similarity, and continuity [25,28]. Specifically, the proximity principle reveals that when similar elements are placed in close proximity of each other, they are more likely to be perceived as belonging to a group.

    • Robust vehicle edge detection by cross filter method

      2015, Proceedings - Applied Imagery Pattern Recognition Workshop
    View full text