Elsevier

Neurocomputing

Volume 304, 23 August 2018, Pages 82-103
Neurocomputing

Methods and datasets on semantic segmentation: A review

https://doi.org/10.1016/j.neucom.2018.03.037Get rights and content

Abstract

Semantic segmentation, also called scene labeling, refers to the process of assigning a semantic label (e.g. car, people, and road) to each pixel of an image. It is an essential data processing step for robots and other unmanned systems to understand the surrounding scene. Despite decades of efforts, semantic segmentation is still a very challenging task due to large variations in natural scenes. In this paper, we provide a systematic review of recent advances in this field. In particular, three categories of methods are reviewed and compared, including those based on hand-engineered features, learned features and weakly supervised learning. In addition, we describe a number of popular datasets aiming for facilitating the development of new segmentation algorithms. In order to demonstrate the advantages and disadvantages of different semantic segmentation models, we conduct a series of comparisons between them. Deep discussions about the comparisons are also provided. Finally, this review is concluded by discussing future directions and challenges in this important field of research.

Introduction

With the ever-increasing range of intelligent applications (e.g. mobile robots), there is an urgent need for accurate scene understanding. As an essential step towards this goal, semantic segmentation thus has received significant attention in recent years. It refers to a process of assigning a semantic label (e.g. car, people) to each pixel of an image. One main challenge of this task is that there are a large amount of classes in natural scenes and some of them show high degree of similarity in visual appearance.

The emergence of terminology “semantic segmentation” can be dated back to 1970s [1]. At that time, this terminology was equivalent to image segmentation but emphasized that the segmented regions must be “semantically meaningful”. In 1990s, “object segmentation and recognition” [2] further distinguished semantic objects of all classes from background and can be viewed as a two-class image segmentation problem. As the complete partition of foreground objects from the background is very challenging, a relaxed two-class image segmentation problem: the sliding window object detection [3], was proposed to partition objects with bounding boxes. It is useful to find where the objects in the scenes with excellent two-class image segmentation algorithms such as constrained parametric min-cuts(CPMC) [4]. However, two-class image segmentation cannot tell what these objects segmented are. As a result, the generic sense of object recognition(or detection) was gradually extended to multi-class image labeling [5], i.e., semantic segmentation in present sense, to tell both where and what the objects in the scene.

In order to achieve high-quality semantic segmentation, there are two commonly concerned questions: how to design efficient feature representations to differentiate objects of various classes, and how to exploit contextual information to ensure the consistency between the labels of pixels. For the first question, most early methods [6], [7], [8] benefit from using the hand-engineered features, such as Scale Invariant Feature Transform (SIFT) [9] and Histograms of Oriented Gradient(HOG) [10]. With the development of deep learning [11], [12], the using of learned features in computer vision tasks, such as image classification [13], [14], has achieved great success in past few years. As a result, the semantic segmentation community recently paid lots of attention to the learned features [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], which are usually refer to Convolutional Neural Networks(CNN or ConvNets) [27]. For the second issue, the most common strategy, no matter the feature used, is to use contextual models such as Markov Random Field(MRF) [28] and Conditional Random Field(CRF) [6], [8], [15], [16], [20], [29], [30], [31], [32], [33], [34]. These graphical models make it very easy to leverage a variety of relationships between classes via setting links between adjacent pixels. More recently, the use of Recurrent Neural Networks(RNN) [35], [36] are more commonly seen in retrieving contextual information. Under the weakly supervised framework [28], [37], [38], [39], [40], [41], another challenging issue is how to learn class models from weakly annotated images, whose labels are provided at image-level rather than pixel-level. To address this challenge, many methods resort to multiple instance learning(MIL) techniques [42].

Although there are many strategies available for addressing the problems mentioned above, these strategies are not yet mature. For example, there are still no universally accepted hand-engineered features while research on learned features has become a focus again only in recent few years. The inference of MRF or CRF is a very challenging issue in itself and often resort to approximation algorithms. Thus, new and creative semantic segmentation methods are being developed and reported continuously.

The main motivation of this paper is to provide a comprehensive survey of semantic segmentation methods, focus on analyzing the commonly concerned problems as well as the corresponding strategies adopted. Semantic segmentation is now a vast field and is closely related to other computer vision tasks. This review cannot fully cover the entire field. Since excellent reviews on research achievements on traditional image segmentation, object segmentation and object detection already exist [43], [44], we will not cover these subjects. We will instead focuses on generic semantic segmentation, i.e., multi-class segmentation. Based on the observation that most works published after 2012 are CNN based, we will divide existing semantic segmentation methods into those based on hand-engineered features and learned features (see Fig. 1). We will discuss weakly supervised methods separately because this challenging line of methods is being investigated actively. It should be emphasized that there are no clear boundaries between these three categories. For each category, we further divide it into several sub-categories and then analyze their motivations and principles.

The rest of this paper is organized as follows. Before introducing recent progresses on semantic segmentation, preliminaries of commonly used theories are given in the next section. Methods using hand-engineered features and learned features are systematically reviewed in Sections 3 and 4, respectively. The efforts devoted to weakly supervised semantic segmentation are described in Section 5. In Section 6, we describe several popular datasets for semantic segmentation tasks. Section 7 compares some representative methods using several common evaluation criteria. Finally, we conclude the paper in Section 8 with our views on future perspectives. Note that semantic segmentation also called as scene labeling in literature, we will not differentiate between them in the rest of this paper.

Section snippets

Preliminaries

We start by describing the commonly used theories and technologies in the semantic segmentation community, including superpixels and contextual models.

Hand-engineered features based scene labeling methods

In this section, we focus on reviewing scene labeling methods that rely on hand-engineered features, such as SIFT [9], HOG [10] and Spin images [64]. Note that we will not discuss the local features in detail owing to page limitations. We refer readers to a comprehensive review [65]. In addition, it should be emphasized that “visual words” described in this section may partially belong to learned features because, in most cases, they are learned from well-designed hand-engineered features. We

Learned features based scene labeling methods

In this section, we review the representative scene labeling methods based on learned features, which usually refer to convolutional neural networks(CNN). Typical CNNs take inputs of fixed-size and produce a single prediction for whole-image classification. However, scene labeling aims to assign a class label to each pixel of arbitrary-sized images. In other words, when applying typical CNNs to scene labeling, the main problem is how to produce pixel-dense predictions. To this end, we first

Weakly and semi- supervised scene labeling methods

The methods mentioned so far require a large number of densely annotated(one pixel with one label) images for training or providing candidate labels for transferring in non-parametric framework. It is well known that such annotation task could be very tedious and time expensive. This has introduced a more challenging research field, namely the weakly supervised scene labeling, where the ground truth is given by only image level labels or bounding boxes.

Public datasets for scene labeling

To inspire new methods and facilitate the comparison between different approaches, many public datasets for scene labeling have been proposed(see Table 4). Sowerby [150] is one of the earliest datasets which mainly includes a variety of urban and rural scenes with small sizes(96 × 64). MSRC [29] is another early dataset and consists of 591 photographs in 23 classes.

It is well known that a large number of training images are more useful for learning models of object categories. Motivated by the

Evaluation of scene labeling methods

It is well-known that image segmentation is an ill-defined problem, how to evaluate the property of an algorithm has always been a critical issue. For the case of semantic segmentation, evaluation is usually defined on the comparison between the algorithm’s output and the ground-truth. Examples including pixel accuracy, class accuracy, Jaccard index, precision and recall.

Pixel Accuracy(P-ACC) and Class-average Accuracy(C-ACC)

The P-ACC is the most widely used evaluation in scene labeling. It

Conclusion and future directions

In this paper, we critically reviewed existing scene labeling methods. The simplest way to solve this challenge issue is to perform pixel-wise classification using hand-engineered features, such as color and texture. However, this line of work easily produces inconsistency in results. Performing superpixel-wise classification can somehow mitigate this problem. Through assigning the same label to a superpixel, the spatial consistency is ensured, but only limited to a local range. A more

Acknowledgment

This work has been supported by National Natural Science Foundation of China (Grant No. 61573135), National Key Technology Support Program(Grant No. 2015BAF11B01), National Key Scientific Instrument and Equipment Development Project of China (Grant No. 2013YQ140517), Hunan Key Laboratory of Intelligent Robot Technology in Electronic Manufacturing (Grant No.2018001), Science and Technology Plan Project of Shenzhen City (JCYJ20170306141557198), Key Project of Science and Technology Plan of

Hongshan Yu received the B.S., M.S. and Ph.D. degrees of Control Science and Technology from electrical and information engineering of Hunan University, Changsha, China, in 2001, 2004 and 2007 respectively. From 2011 to 2012, he worked as a postdoctoral researcher in Laboratory for Computational Neuroscience of University of Pittsburgh, USA. He is currently an associate professor of Hunan University and associate dean of National Engineering Laboratory for Robot Visual Perception and Control.

References (180)

  • S. Gupta et al.

    Perceptual organization and recognition of indoor scenes from RGB-D images

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • S.H. Khan et al.

    Geometry driven semantic labeling of indoor scenes

    Proceedings of the European Conference on Computer Vision

    (2014)
  • D.G. Lowe

    Object recognition from local scale-invariant features

    Proceedings of the IEEE International Conference on Computer Vision

    (1999)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2005)
  • G.E. Hinton et al.

    A fast learning algorithm for deep belief nets

    Neural Comput.

    (2006)
  • Y. LeCun et al.

    Deep learning

    Nature

    (2015)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Proceedings of the Advances in neural information processing systems

    (2012)
  • K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,...
  • C. Farabet et al.

    Learning hierarchical features for scene labeling

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Semantic image segmentation with deep convolutional...
  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • D. Eigen et al.

    Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • J. Long et al.

    Fully convolutional networks for semantic segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • C. Couprie, C. Farabet, L. Najman, Y. LeCun, Indoor semantic segmentation using depth information,...
  • S. Gupta et al.

    Learning rich features from RGB-D images for object detection and segmentation

    Proceedings of the European Conference on Computer Vision

    (2014)
  • NohH. et al.

    Learning deconvolution network for semantic segmentation

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • J.M. Alvarez et al.

    Semantic road segmentation via multi-scale ensembles of learned features

    Proceedings of the European Conference on Computer Vision

    (2012)
  • G. Bertasius et al.

    Semantic segmentation with boundary neural fields

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network,...
  • C. Peng, X. Zhang, G. Yu, G. Luo, J. Sun, Large kernel matters–improve semantic segmentation by global convolutional...
  • Y. LeCun et al.

    Gradient-based learning applied to document recognition

    Proc. IEEE

    (1998)
  • J. Verbeek et al.

    Region classification with Markov field aspect models

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2007)
  • J. Shotton et al.

    Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation

    Proceedings of the European conference on computer vision

    (2006)
  • P. Kohli et al.

    Robust higher order potentials for enforcing label consistency

    Int. J. Comput. Vis.

    (2009)
  • L. Ladick et al.

    Associative hierarchical CRFs for object class image segmentation

    Proceedings of the IEEE International Conference on Computer Vision

    (2009)
  • S. Gould et al.

    Decomposing a scene into geometric and semantically consistent regions

    Proceedings of the IEEE International Conference on Computer Vision

    (2009)
  • X. Ren et al.

    RGB-D scene labeling: Features and algorithms

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • HeX. et al.

    Multiscale conditional random fields for image labeling

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2004)
  • F. Visin et al.

    Reseg: A recurrent neural network-based model for semantic segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    (2016)
  • ShuaiB. et al.

    Dag-recurrent neural networks for scene labeling

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • A. Vezhnevets et al.

    Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2010)
  • A. Vezhnevets et al.

    Weakly supervised semantic segmentation with a multi-image model

    Proceedings of the IEEE International Conference on Computer Vision

    (2011)
  • G. Papandreou et al.

    Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • D. Pathak, E. Shelhamer, J. Long, T. Darrell, Fully convolutional multi-class multiple instance learning, arXiv...
  • ZhangL. et al.

    A probabilistic associative model for segmenting weakly supervised images

    IEEE Trans. Image Process.

    (2014)
  • D.K. Prasad

    Survey of the problem of object detection in real images

    Int. J. Image Process.

    (2012)
  • X. Ren et al.

    Learning a classification model for segmentation

    Proceedings of the IEEE International Conference on Computer Vision

    (2003)
  • D. Comaniciu et al.

    Mean shift: a robust approach toward feature space analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • ShiJ. et al.

    Normalized cuts and image segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • A. Levinshtein et al.

    Turbopixels: Fast superpixels using geometric flows

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • Cited by (198)

    • On the coherency of quantitative evaluation of visual explanations

      2024, Computer Vision and Image Understanding
    • Contrastive enhancement using latent prototype for few-shot segmentation

      2024, Digital Signal Processing: A Review Journal
    View all citing articles on Scopus

    Hongshan Yu received the B.S., M.S. and Ph.D. degrees of Control Science and Technology from electrical and information engineering of Hunan University, Changsha, China, in 2001, 2004 and 2007 respectively. From 2011 to 2012, he worked as a postdoctoral researcher in Laboratory for Computational Neuroscience of University of Pittsburgh, USA. He is currently an associate professor of Hunan University and associate dean of National Engineering Laboratory for Robot Visual Perception and Control. His research interests include autonomous mobile robot and machine vision.

    Zhengeng Yang received the B.S. and M.S. degrees from Central South University, Changsha, China, in 2009 and 2012 respectively. He is currently a Ph.D. candidate at the Hunan University, Changsha, China. His research interests include computer vision, image analysis and machine learning, especially focus on the problems of semantic segmentation.

    Lei Tan received the B.S. and M.S. degrees of Control Science and Technology from electrical and information engineering of Hunan University, Changsha, China, in 2007 and 2010 respectively. From 2013 to 2015, he visited the Robotics Institute at Carnegie Mellon University, USA as a joint training Ph.D. student sponsored by the China Scholarship Council. His research interests mainly focus on computer vision and mobile robot.

    Yaonan Wang received the B.S. degree in computer engineering from East China Technology Institute in 1981, and the M.S. and Ph.D. degrees in electrical engineering from Hunan University, Changsha, China in 1990 and 1994 respectively. From 1994 to 1995, he was a Postdoctoral Research Fellow with the National University of Defense Technology. He is currently a Professor at Hunan University. His research interests are image processing, pattern recognition, and robot control.

    Wei Sun received the M.S. and Ph.D. degrees of Control Science and Technology from the Hunan University, Changsha, China, in 1999 and 2002, respectively. He is currently a Professor at Hunan University. His research interests include artificial intelligence, robot control, complex mechanical and electrical control systems, and automotive electronics.

    Mingui Sun received the B.S. degree in instrumental and industrial automation from Shenyang Chemical Engineering Institute, Shenyang, China, in 1982, and the M.S. and Ph.D. degrees in electrical engineering from the University of Pittsburgh, Pittsburgh, PA, USA, in 1986 and 1989, respectively. He is currently a Professor of neurosurgery, electrical and computer engineering, and bioengineering. His current research interests include advanced biomedical electronic devices, biomedical signal and image processing, sensors and transducers, artificial neural networks.

    Yandong Tang received the B.S. and M.S. degrees in mathematics from Shandong University, China, in 1984 and 1987, respectively. He received the Ph.D. degree in applied mathematics from the University of Bremen, Germany, in 2002. He is a Professor at the Shenyang Institute of Automation, Chinese Academy of Sciences. His research interests include numerical computation, image processing, and computer vision.

    View full text