Full Length Article
Deep interactive image segmentation based on region and Boundary-click guidance

https://doi.org/10.1016/j.jvcir.2023.103797Get rights and content

Highlights

  • Region and boundary clicks are used to be a user interaction strategy.

  • An interactive two-stream network structure is used to learn the region and boundary features of interest.

  • An information diffusion module is used to propagate the region and boundary-click labels.

Abstract

In interactive image segmentation, the target object of interest can be extracted based on the guidance of user interactions. One of the main goals in this task is to reduce the user interaction burden and ensure satisfactory segmentation with as few interactions as possible. Thanks to the development of deep learning technology, neural network-based interactive approaches have significantly improved the segmentation performance through powerful feature representation. Only limited point (click) interaction is required for the user to complete the segmentation. This paper mainly follows the deep learning-based interactive segmentation methods and explores more efficient interaction strategies and effective segmentation models. We further simplify user interaction to two clicks, where the first click is utilized to select the target region and the other aims to determine the target boundary. Based on the region and boundary clicks, an interactive two-stream network structure is naturally derived to learn the region and boundary features of interest. Furthermore, we also construct an information diffusion module to better propagate the region and boundary-click labels, which helps to enhance the similarity within the region and the discrimination between boundaries. Vast experiments on the popular GrabCut, Berkeley, DAVIS, MS COCO and SBD datasets verified the effectiveness of the proposed method.

Introduction

As a basic computer vision task, interactive image segmentation [10], [32], [34], [35], [1], [2], [3], [4], [5], [6], [7], [8], [15], [16], [17], [18] has been widely studied recently, in which users can extract interested targets through interaction, such as bounding boxes [10], [29], [30] or scribbles [31], [32], [33]. Thanks to strong feature representation, the deep learning method has significantly boosted the performance of interactive segmentation task [1], [2], [3], [4], [5], [6], [7], [8], [15], [16], [17], [18]. This paper mainly focuses on the research of deep learning-based interactive segmentation.

Deep learning-based interactive methods require training an effective segmentation network from a large number of data samples including images and corresponding user interactions. It is generally time-consuming to collect samples of user interaction for each instance in each image. Early methods [1], [2], [3], [4], [5] automatically simulated user interaction in the form of pixel points (clicks) based on the ground-truth annotation. However, the sparse click-level information in the whole image space domain is easy to be ‘forgotten’ in the long path deep network. More interactive corrections are required for these approaches to improve the segmentation results. To address the above problems, a few later methods [6], [7], [8] try to reduce the search space to increase the focus of the click based on points at special locations of the target. For example, Maninis et al. [6] explore the use of 4 extreme points of an object (top, bottom, left-most, right-most pixels) to train an interactive segmentation network. Zhang et al. [8] develop an inside-outside guidance approach for object selection with explicit 3 points (top-left, bottom-right, center pixels of an object). Though remarkable results are achieved by these approaches, it is still laborious for the user to click at four extreme boundaries [6] or two symmetrical corners [8] of an object. Based on the above works, this paper further explores more efficient interaction strategies and reduces the number of clicks required for each target to 2 (one for object region selection and the other for object boundary selection).

In terms of network structure design, existing methods [1], [2], [3], [4], [5], [6], [7], [8] generally concatenate the original image and its corresponding click map and input them together into a fully convolutional network (FCN) model [23]. In FCN, it is easy to lose image spatial details when abstracting high-level semantic features, which often leads to inaccurate localization of the object boundary. In the conventional (non-deep learning) methods, the segmentation model is generally constructed based on both the region and the boundary information, where the former keeps the internal consistency of the object and the latter helps to ensure the smoothness of the boundary. Drawing on this, double-branch (‘region + edge’) structures are widely designed to overcome the above boundary localization problem in many image semantic segmentation approaches [9], [36]. Referring to the above works, this paper explores a two-stream network structure for interactive segmentation task, where the region feature and edge feature streams correspond to the region selection and boundary selection clicks, respectively.

On the other hand, attention mechanism methods have been extensively studied in many computer vision tasks, including semantic segmentation [19], [27], [37], [38], in which the image context information can be effectively utilized to improve the segmentation result. However, as pointed out in [17], the attention strategy in semantic segmentation can only respond to the predefined fixed label space, not directly applicable to the unfixed label space defined by the user in the interactive segmentation. In this case, they [17] try to adjust the affinity matrix in self-attention [19] based on the prior information (click map and coarse segmentation) to adapt to the two-class (foreground-background) segmentation task. For the proposed interactive two-stream structure, we combine the deep region and edge features to effectively propagate the labels of region and boundary clicks.

As described above, this paper designs a deep learning-based interactive image segmentation model. First, we simplify the user interaction mode to only two clicks, where the first click is determined inside the object (such as the center position) to select a target region of interest and the second click is determined on the edge of the object to select a target boundary of interest. It should be emphasized that the second boundary click is easy to select for the user when the first region click is determined, which ensures the efficiency of the proposed interaction strategy. Furthermore, a circle can be naturally formed based on the above two clicks to completely enclose the target object, which effectively reduces the search space and increases the focus of clicks at a higher resolution. In addition, as shown in Fig. 1, specific two region and boundary clicks are supported by the proposed architecture, and continuous correction clicks (one click at a time) are required for the existing architectures, which leads to that our method only needs to performance one round of segmentation for an object both in training and testing and existing approaches need to execute iterative multiple rounds of segmentation for an object during training (multiple forward and backward transmission) and test (continuous interactive correction). Second, based on the above interaction strategy, an interactive two-stream segmentation network is naturally designed to learn the region features and boundary features of the object of interest respectively. The interaction and combination of region and boundary information itself can help to promote the segmentation. Furthermore, existing methods generally adopt a FCN-like single branch structure to propagate the sparse user interaction information forward, which easily loses the spatial details from high to low resolutions (such as the pooling step) and proposed architecture with the estimation of object edge can effectively compensate for the loss of spatial information in FCN, Last, an ASPP-like structure is generally utilized to capture the context in the existing methods, which responses to a fixed label space and cannot adapt to the user-defined label space in the interactive segmentation task. Unlike ASPP, our information diffusion module is designed to propagate the region-click and boundary-click labels and captures the context information of the interested target, which helps to enhance the correlation within the region and the discrimination between boundaries.

Section snippets

Related works

User interaction types: While most methods [1], [2], [3], [4], [5] based on positive and negative clicks become mainstream, many works explored other forms of clicks. For example, DEXTR [6] places clicks at the four extreme positions of the target where object will be cropped precisely. Since DEXTR is ambiguous for overlapping targets, IOG [8] leverage an inside point to specify the target area and two outside points to crop the region of interest. Additionally, because four extreme positions

User interaction mode

DEXTR [6] places the clicks at the four extreme positions of the target. IOG [8] explores one interior point to specify the object and two exterior points to crop the region of interest. Extreme points can provide the scale information of the target object and help eliminate ambiguities between objects. Based on the above methods, we further reduce user interaction to two points, where the first point is placed within the target region, such as the interior center of an object (for example the

Experimental configurations

Datasets: The proposed method is evaluated on five common benchmarks, including GrabCut [10], Berkeley [11], DAVIS [13], MS COCO[14] and SBD [12]. The color space of all datasets is RGB. The GrabCut dataset contains 50 images with multiple scales, and it is commonly used to evaluate interactive segmentation approaches. The Berkeley dataset contains 96 images with a size of 321×481 or 481×321 and it has 100 instance masks for testing. The DAVIS dataset is utilized for video object segmentation,

Conclusion

In this paper, we design a novel interactive segmentation pipeline. We propose an effective interaction strategy based on region-click and boundary-click to release user burden with simulated circles. Corresponding to the interaction strategy, a two-stream segmentation network is naturally introduced to learn region and boundary features separately. In addition, an information diffusion module is designed to propagate the labels of region and boundary clicks. Experiments show that our method is

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant 62172221 and in part by the Fundamental Research Funds for the Central Universities under Grant NO. JSGP202204.

References (45)

  • Kevin McGuinness et al.

    A comparative evaluation of interactive segmentation algorithms

    Pattern Recognit.

    (2010)
  • Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S Huang, Deep interactive object selection, in: Proceedings...
  • Won-Dong Jang and Chang-Su Kim, Interactive image segmentation via backpropagating refinement scheme, in: Proceedings...
  • Konstantin Sofiiuk, Ilia Petrov, Olga Barinova, and Anton Konushin, f-BRS: Rethinking backpropagating refinement for...
  • Konstantin Sofiiuk, Ilia A Petrov, and Anton Konushin, Reviving iterative training with mask guidance for interactive...
  • Yuying Hao, Yi Liu, Zewu Wu, Lin Han, Yizhou Chen, Guowei Chen, Lutao Chu, Shiyu Tang, Zhiliang Yu, Zeyu Chen,...
  • Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool, Deep extreme cut: From extreme points to...
  • C. Dupont, Y. Ouakrim, and Q. C. Pham, UCP-net: Unstructured contour points for instance segmentation, in: 2021 IEEE...
  • Shiyin Zhang, Jun Hao Liew, Yunchao Wei, Shikui Wei, and Yao Zhao, Interactive object segmentation with inside-outside...
  • Towaki Takikawa, David Acuna, Varun Jampani, Sanja Fidler. Gated-scnn: Gated shape cnns for semantic segmentation, in:...
  • C. Rother et al.

    ”grabcut” interactive foreground extraction using iterated graph cuts

    In ACM transactions on graphics (TOG)

    (2004)
  • B. Hariharan et al.

    Semantic contours from inverse detectors

  • Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung, A...
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, C. Lawrence...
  • Zheng Lin, Zhao Zhang, Lin-Zhuo Chen, Ming-Ming Cheng, and Shao-Ping Lu, Interactive image segmentation with first...
  • Marco Forte, Brian Price, Scott Cohen, Ning Xu, François Pitié, Getting to 99% accuracy in interactive segmentation,...
  • Xi Chen, Zhiyan Zhao, Feiwu Yu, Yilei Zhang, Manni Duan, Conditional diffusion for interactive segmentation, in:...
  • Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, Hengshuang Zhao, FocalClick: Towards practical interactive...
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He, Non-local neural networks, in: Proceedings of the IEEE...
  • Agrim Gupta, Piotr Dollar, and Ross Girshick, Lvis: A dataset for large vocabulary instance segmentation, in:...
  • Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun, Cascaded pyramid network for...
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell, Fully convolutional networks for semantic segmentation, in:...
  • Cited by (1)

    • UAV small target detection algorithm based on an improved YOLOv5s model

      2023, Journal of Visual Communication and Image Representation

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text