Methods and datasets on semantic segmentation: A review
Introduction
With the ever-increasing range of intelligent applications (e.g. mobile robots), there is an urgent need for accurate scene understanding. As an essential step towards this goal, semantic segmentation thus has received significant attention in recent years. It refers to a process of assigning a semantic label (e.g. car, people) to each pixel of an image. One main challenge of this task is that there are a large amount of classes in natural scenes and some of them show high degree of similarity in visual appearance.
The emergence of terminology “semantic segmentation” can be dated back to 1970s [1]. At that time, this terminology was equivalent to image segmentation but emphasized that the segmented regions must be “semantically meaningful”. In 1990s, “object segmentation and recognition” [2] further distinguished semantic objects of all classes from background and can be viewed as a two-class image segmentation problem. As the complete partition of foreground objects from the background is very challenging, a relaxed two-class image segmentation problem: the sliding window object detection [3], was proposed to partition objects with bounding boxes. It is useful to find where the objects in the scenes with excellent two-class image segmentation algorithms such as constrained parametric min-cuts(CPMC) [4]. However, two-class image segmentation cannot tell what these objects segmented are. As a result, the generic sense of object recognition(or detection) was gradually extended to multi-class image labeling [5], i.e., semantic segmentation in present sense, to tell both where and what the objects in the scene.
In order to achieve high-quality semantic segmentation, there are two commonly concerned questions: how to design efficient feature representations to differentiate objects of various classes, and how to exploit contextual information to ensure the consistency between the labels of pixels. For the first question, most early methods [6], [7], [8] benefit from using the hand-engineered features, such as Scale Invariant Feature Transform (SIFT) [9] and Histograms of Oriented Gradient(HOG) [10]. With the development of deep learning [11], [12], the using of learned features in computer vision tasks, such as image classification [13], [14], has achieved great success in past few years. As a result, the semantic segmentation community recently paid lots of attention to the learned features [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], which are usually refer to Convolutional Neural Networks(CNN or ConvNets) [27]. For the second issue, the most common strategy, no matter the feature used, is to use contextual models such as Markov Random Field(MRF) [28] and Conditional Random Field(CRF) [6], [8], [15], [16], [20], [29], [30], [31], [32], [33], [34]. These graphical models make it very easy to leverage a variety of relationships between classes via setting links between adjacent pixels. More recently, the use of Recurrent Neural Networks(RNN) [35], [36] are more commonly seen in retrieving contextual information. Under the weakly supervised framework [28], [37], [38], [39], [40], [41], another challenging issue is how to learn class models from weakly annotated images, whose labels are provided at image-level rather than pixel-level. To address this challenge, many methods resort to multiple instance learning(MIL) techniques [42].
Although there are many strategies available for addressing the problems mentioned above, these strategies are not yet mature. For example, there are still no universally accepted hand-engineered features while research on learned features has become a focus again only in recent few years. The inference of MRF or CRF is a very challenging issue in itself and often resort to approximation algorithms. Thus, new and creative semantic segmentation methods are being developed and reported continuously.
The main motivation of this paper is to provide a comprehensive survey of semantic segmentation methods, focus on analyzing the commonly concerned problems as well as the corresponding strategies adopted. Semantic segmentation is now a vast field and is closely related to other computer vision tasks. This review cannot fully cover the entire field. Since excellent reviews on research achievements on traditional image segmentation, object segmentation and object detection already exist [43], [44], we will not cover these subjects. We will instead focuses on generic semantic segmentation, i.e., multi-class segmentation. Based on the observation that most works published after 2012 are CNN based, we will divide existing semantic segmentation methods into those based on hand-engineered features and learned features (see Fig. 1). We will discuss weakly supervised methods separately because this challenging line of methods is being investigated actively. It should be emphasized that there are no clear boundaries between these three categories. For each category, we further divide it into several sub-categories and then analyze their motivations and principles.
The rest of this paper is organized as follows. Before introducing recent progresses on semantic segmentation, preliminaries of commonly used theories are given in the next section. Methods using hand-engineered features and learned features are systematically reviewed in Sections 3 and 4, respectively. The efforts devoted to weakly supervised semantic segmentation are described in Section 5. In Section 6, we describe several popular datasets for semantic segmentation tasks. Section 7 compares some representative methods using several common evaluation criteria. Finally, we conclude the paper in Section 8 with our views on future perspectives. Note that semantic segmentation also called as scene labeling in literature, we will not differentiate between them in the rest of this paper.
Section snippets
Preliminaries
We start by describing the commonly used theories and technologies in the semantic segmentation community, including superpixels and contextual models.
Hand-engineered features based scene labeling methods
In this section, we focus on reviewing scene labeling methods that rely on hand-engineered features, such as SIFT [9], HOG [10] and Spin images [64]. Note that we will not discuss the local features in detail owing to page limitations. We refer readers to a comprehensive review [65]. In addition, it should be emphasized that “visual words” described in this section may partially belong to learned features because, in most cases, they are learned from well-designed hand-engineered features. We
Learned features based scene labeling methods
In this section, we review the representative scene labeling methods based on learned features, which usually refer to convolutional neural networks(CNN). Typical CNNs take inputs of fixed-size and produce a single prediction for whole-image classification. However, scene labeling aims to assign a class label to each pixel of arbitrary-sized images. In other words, when applying typical CNNs to scene labeling, the main problem is how to produce pixel-dense predictions. To this end, we first
Weakly and semi- supervised scene labeling methods
The methods mentioned so far require a large number of densely annotated(one pixel with one label) images for training or providing candidate labels for transferring in non-parametric framework. It is well known that such annotation task could be very tedious and time expensive. This has introduced a more challenging research field, namely the weakly supervised scene labeling, where the ground truth is given by only image level labels or bounding boxes.
Public datasets for scene labeling
To inspire new methods and facilitate the comparison between different approaches, many public datasets for scene labeling have been proposed(see Table 4). Sowerby [150] is one of the earliest datasets which mainly includes a variety of urban and rural scenes with small sizes(96 × 64). MSRC [29] is another early dataset and consists of 591 photographs in 23 classes.
It is well known that a large number of training images are more useful for learning models of object categories. Motivated by the
Evaluation of scene labeling methods
It is well-known that image segmentation is an ill-defined problem, how to evaluate the property of an algorithm has always been a critical issue. For the case of semantic segmentation, evaluation is usually defined on the comparison between the algorithm’s output and the ground-truth. Examples including pixel accuracy, class accuracy, Jaccard index, precision and recall.
Pixel Accuracy(P-ACC) and Class-average Accuracy(C-ACC)
The P-ACC is the most widely used evaluation in scene labeling. It
Conclusion and future directions
In this paper, we critically reviewed existing scene labeling methods. The simplest way to solve this challenge issue is to perform pixel-wise classification using hand-engineered features, such as color and texture. However, this line of work easily produces inconsistency in results. Performing superpixel-wise classification can somehow mitigate this problem. Through assigning the same label to a superpixel, the spatial consistency is ensured, but only limited to a local range. A more
Acknowledgment
This work has been supported by National Natural Science Foundation of China (Grant No. 61573135), National Key Technology Support Program(Grant No. 2015BAF11B01), National Key Scientific Instrument and Equipment Development Project of China (Grant No. 2013YQ140517), Hunan Key Laboratory of Intelligent Robot Technology in Electronic Manufacturing (Grant No.2018001), Science and Technology Plan Project of Shenzhen City (JCYJ20170306141557198), Key Project of Science and Technology Plan of
Hongshan Yu received the B.S., M.S. and Ph.D. degrees of Control Science and Technology from electrical and information engineering of Hunan University, Changsha, China, in 2001, 2004 and 2007 respectively. From 2011 to 2012, he worked as a postdoctoral researcher in Laboratory for Computational Neuroscience of University of Pittsburgh, USA. He is currently an associate professor of Hunan University and associate dean of National Engineering Laboratory for Robot Visual Perception and Control.
References (180)
Multiple instance classification: Review, taxonomy and comparative study
Artif. Intell.
(2013)- et al.
Beyond pixels: a comprehensive survey from bottom-up to semantic image segmentation and cosegmentation
J. Vis. Commun. Image Represent.
(2016) - et al.
A survey of graph theoretical approaches to image segmentation
Pattern Recognit.
(2013) - et al.
A comprehensive review of current local features for computer vision
Neurocomputing
(2008) - et al.
An analysis system for scenes containing objects with substructures
Proceedings of the Fourth International Joint Conference on Pattern Recognitions
(1978) - et al.
Integrating visual cues for object segmentation and recognition
Opt. News
(1989) - et al.
Rapid object detection using a boosted cascade of simple features
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2001) - et al.
Cpmc: automatic object segmentation using constrained parametric min-cuts
IEEE Trans. Pattern Anal. Mach. Intell.
(2012) - et al.
A statistical model for general contextual object recognition
Proceedings of the European Conference on Computer Vision
(2004) - et al.
Indoor scene segmentation using a structured light sensor
Proceedings of the IEEE International Conference on Computer Vision Workshops
(2011)
Perceptual organization and recognition of indoor scenes from RGB-D images
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Geometry driven semantic labeling of indoor scenes
Proceedings of the European Conference on Computer Vision
Object recognition from local scale-invariant features
Proceedings of the IEEE International Conference on Computer Vision
Histograms of oriented gradients for human detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
A fast learning algorithm for deep belief nets
Neural Comput.
Deep learning
Nature
Imagenet classification with deep convolutional neural networks
Proceedings of the Advances in neural information processing systems
Learning hierarchical features for scene labeling
IEEE Trans. Pattern Anal. Mach. Intell.
Rich feature hierarchies for accurate object detection and semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture
Proceedings of the IEEE International Conference on Computer Vision
Fully convolutional networks for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Learning rich features from RGB-D images for object detection and segmentation
Proceedings of the European Conference on Computer Vision
Learning deconvolution network for semantic segmentation
Proceedings of the IEEE International Conference on Computer Vision
Semantic road segmentation via multi-scale ensembles of learned features
Proceedings of the European Conference on Computer Vision
Semantic segmentation with boundary neural fields
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Gradient-based learning applied to document recognition
Proc. IEEE
Region classification with Markov field aspect models
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation
Proceedings of the European conference on computer vision
Robust higher order potentials for enforcing label consistency
Int. J. Comput. Vis.
Associative hierarchical CRFs for object class image segmentation
Proceedings of the IEEE International Conference on Computer Vision
Decomposing a scene into geometric and semantically consistent regions
Proceedings of the IEEE International Conference on Computer Vision
RGB-D scene labeling: Features and algorithms
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Multiscale conditional random fields for image labeling
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Reseg: A recurrent neural network-based model for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
Dag-recurrent neural networks for scene labeling
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Weakly supervised semantic segmentation with a multi-image model
Proceedings of the IEEE International Conference on Computer Vision
Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation
Proceedings of the IEEE International Conference on Computer Vision
A probabilistic associative model for segmenting weakly supervised images
IEEE Trans. Image Process.
Survey of the problem of object detection in real images
Int. J. Image Process.
Learning a classification model for segmentation
Proceedings of the IEEE International Conference on Computer Vision
Mean shift: a robust approach toward feature space analysis
IEEE Trans. Pattern Anal. Mach. Intell.
Normalized cuts and image segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
Turbopixels: Fast superpixels using geometric flows
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (198)
On the coherency of quantitative evaluation of visual explanations
2024, Computer Vision and Image UnderstandingMFCANet: A road scene segmentation network based on Multi-Scale feature fusion and context information aggregation
2024, Journal of Visual Communication and Image RepresentationA multi-branched semantic segmentation network based on twisted information sharing pattern for medical images
2024, Computer Methods and Programs in BiomedicineContrastive enhancement using latent prototype for few-shot segmentation
2024, Digital Signal Processing: A Review JournalDataset for flood area recognition with semantic segmentation
2023, Data in Brief
Hongshan Yu received the B.S., M.S. and Ph.D. degrees of Control Science and Technology from electrical and information engineering of Hunan University, Changsha, China, in 2001, 2004 and 2007 respectively. From 2011 to 2012, he worked as a postdoctoral researcher in Laboratory for Computational Neuroscience of University of Pittsburgh, USA. He is currently an associate professor of Hunan University and associate dean of National Engineering Laboratory for Robot Visual Perception and Control. His research interests include autonomous mobile robot and machine vision.
Zhengeng Yang received the B.S. and M.S. degrees from Central South University, Changsha, China, in 2009 and 2012 respectively. He is currently a Ph.D. candidate at the Hunan University, Changsha, China. His research interests include computer vision, image analysis and machine learning, especially focus on the problems of semantic segmentation.
Lei Tan received the B.S. and M.S. degrees of Control Science and Technology from electrical and information engineering of Hunan University, Changsha, China, in 2007 and 2010 respectively. From 2013 to 2015, he visited the Robotics Institute at Carnegie Mellon University, USA as a joint training Ph.D. student sponsored by the China Scholarship Council. His research interests mainly focus on computer vision and mobile robot.
Yaonan Wang received the B.S. degree in computer engineering from East China Technology Institute in 1981, and the M.S. and Ph.D. degrees in electrical engineering from Hunan University, Changsha, China in 1990 and 1994 respectively. From 1994 to 1995, he was a Postdoctoral Research Fellow with the National University of Defense Technology. He is currently a Professor at Hunan University. His research interests are image processing, pattern recognition, and robot control.
Wei Sun received the M.S. and Ph.D. degrees of Control Science and Technology from the Hunan University, Changsha, China, in 1999 and 2002, respectively. He is currently a Professor at Hunan University. His research interests include artificial intelligence, robot control, complex mechanical and electrical control systems, and automotive electronics.
Mingui Sun received the B.S. degree in instrumental and industrial automation from Shenyang Chemical Engineering Institute, Shenyang, China, in 1982, and the M.S. and Ph.D. degrees in electrical engineering from the University of Pittsburgh, Pittsburgh, PA, USA, in 1986 and 1989, respectively. He is currently a Professor of neurosurgery, electrical and computer engineering, and bioengineering. His current research interests include advanced biomedical electronic devices, biomedical signal and image processing, sensors and transducers, artificial neural networks.
Yandong Tang received the B.S. and M.S. degrees in mathematics from Shandong University, China, in 1984 and 1987, respectively. He received the Ph.D. degree in applied mathematics from the University of Bremen, Germany, in 2002. He is a Professor at the Shenyang Institute of Automation, Chinese Academy of Sciences. His research interests include numerical computation, image processing, and computer vision.