Non-parametric Contextual Relationship Learning for Semantic Video Object Segmentation

Wang, Tinghuai; Wang, Huiling

doi:10.1007/978-3-030-13469-3_38

Tinghuai Wang¹⁷ &
Huiling Wang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11401))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

1940 Accesses
1 Citations

Abstract

We propose a novel approach for modeling semantic contextual relationships in videos. This graph-based model enables the learning and propagation of higher-level spatial-temporal contexts to facilitate the semantic labeling of local regions. We introduce an exemplar-based nonparametric view of contextual cues, where the inherent relationships implied by object hypotheses are encoded on a similarity graph of regions. Contextual relationships learning and propagation are performed to estimate the pairwise contexts between all pairs of unlabeled local regions. Our algorithm integrates the learned contexts into a Conditional Random Field (CRF) in the form of pairwise potentials and infers the per-region semantic labels. We evaluate our approach on the challenging YouTube-Objects dataset which shows that the proposed contextual relationship model outperforms the state-of-the-art methods.

You have full access to this open access chapter, Download conference paper PDF

Video parsing via spatiotemporally analysis with images

Article 07 July 2015

Xuelong Li, Lichao Mou & Xiaoqiang Lu

Semantic Co-segmentation in Videos

Video Object Discovery and Co-segmentation with Extremely Weak Supervision

1 Introduction

Semantic object segmentation in videos is a challenging task which enables a wide range of higher-level applications, such as robotic vision, object tracking, video retrieval and scene understanding. Tremendous progress has been witnessed lately toward this problem via integrating higher-level semantic information and contextual cues [2, 5, 8, 13, 15, 16, 18, 23]. However, akin to classical figure-ground video segmentation, fast motion, appearance variations, pose change, and occlusions pose significant challenges to delineate semantic objects from video sequence. Difficulty in resolving the inherent semantic ambiguities further complicates the problem.

Recently, segmentation by detection and tracking approaches have been proposed to address this challenging problem. Early work in this direction trained classifiers to incorporate scene topology and semantics into pixel-level object detection and localization [15]. Later, both object detector and tracker were employed to either impose spatio-temporal coherence [2, 23] or learn an appearance model [16] for encoding the appearance variation of semantic objects. Lately, hierarchical graphical model has also been proposed to integrate longer-range object reasoning with superpixel labeling [18]. Despite of significant advances that have been made by the above methods, global contextual relationships between semantic video objects remain under-explored. Yet, contextual relationships are ubiquitous and provide important cues for scene understanding related tasks.

The importance of exploiting pairwise relationships between objects has been highlighted in semantic segmentation [6] and object detection [4] tasks, where the relationship is formulated in terms of co-occurrence of higher-level statistics of object class. These methods tend to favor frequently appeared objects in the training data to enforce rigid semantic label agreement. Furthermore, these conventional context models are sensitive to the number of pixels or regions that objects occupy, with one consequence being that the small objects are more likely to be omitted.

In this work, we propose a novel graphical model to thoroughly exploit contextual relationships among semantic video objects without relying on training data. Such a way of modeling spatio-temporal object contextual relationships has not been well studied. We present a novel nonparametric approach to capture the intra- and inter- category contextual relationships by considering the content of an input video. This nonparametric context model is comprised of a set of spatial-temporal context exemplars via performing higher-level video analysis, i.e. object detection and tracking. These context exemplars provide a novel interpretation of contextual relationships in a link view which formulates the problem of learning contextual relationships as the label propagation problem on a similarity graph. This similarity graph naturally reflects the intrinsic and extrinsic relationship between semantic objects in the spatial-temporal domain. Due to the sparsity of this similarity graph, the learning process can be very efficient.

The key contributions of this work are as follows. Firstly, we establishes a novel link prediction view of semantic contexts. In this view, the problem of learning semantic relationships is formulated as graph-based label propagation problem. Secondly, our approach is exemplar-based nonparametric model which therefore does not require additional training data to build an explicit context model. Hence, it is favorable for video semantic object segmentation, a domain where annotated data are scarce. The paper is organized as follows. We introduce the novel link predition view of contexts in Sect. 2.3, utilizing the semantic contextual information from object trajectory hypotheses in Sect. 2.1. Link prediction algorithm is described in Sect. 2.4 and the final semantic labeling is described in Sect. 2.5.

2 The Approach

In this section, we describe our proposed exemplar-based nonparametric model and how the learned contextual relationships are integrated into semantic labeling in a principled manner.

2.1 Trajectory Hypotheses

For a given video sequence with T frames, we generate a set of object trajectory hypotheses with respect to semantic categories via object detection and temporal association which characterize the long-range spatio-temporal evolution of various object features and are commonly tied to higher-level contexts such as object interactions and behaviours [7, 14, 17,18,19,20,21,22].

Specifically, we firstly extract generic object proposals by applying MCG [1] in each frame. Object detection is performed on this pool of object proposals by using faster R-CNN [11], which is trained on 20 PASCAL VOC classes. A set of object hypotheses $\mathbb {D}$ are formed by keeping proposals with detection confidence exceeding a threshold (0.5).

Object trajectory hypotheses $\mathbb {T}$ w.r.t. each semantic class are generated by temporally associating a cohort of object hypotheses $\mathbb {D}$ by imposing frame-to-frame spatio-temporal consistency, similar to [18]. Specifically, we utilize object tracker [9] to track object hypotheses over time to both ends of the video sequence as follows.

Initialize an empty trajectory hypothesis $T_i \in \mathbb {T}$
Rank remaining object hypotheses in $\mathbb {D}$ based on detection confidence
Initialize tracker with the bounding box of the highest ranked object hypothesis and perform tracking to both directions simultaneously
Select object hypothesis in the new frame which have a sufficient overlap, i.e. Intersection-over-Union (IoU) higher than a threshold (0.5), with the tracker box which is added to $T_i$ and consequently removed from $\mathbb {D}$.

The above steps are iteratively performed until no new trajectory hypothesis containing three or more object instances can be generated from $\mathbb {D}$. Figure 1 shows exemplars of object trajectory hypotheses extracted from a video sequence. Regions [3] are extracted from each frame as the atomic data units. Let $\mathcal {R}_D$ be the set of regions constituting video object hypotheses, and $\mathcal {R}_U$ be the unlabeled regions.

2.2 Graph Construction

We firstly initialize a k-nearest neighbor similarity graph $\mathcal {G}=(\mathcal {V},\mathcal {E})$ between all N regions from $\mathcal {R}_D \cup \mathcal {R}_U$. Each vertex $v_{i} \in \mathcal {V}$ of the graph is described by the L2-normalized VGG-16 Net [12] fc6 features $f_i$ of the corresponding region. Each weight $w_{i,j} \in \mathbf {W}$ of edge $e_{i,j} \in \mathcal {E}$ is defined as the inner-product between the feature vectors of neighboring vertices, i.e., $w_{i,j} = <f_i, f_j>$.

2.3 Context Modeling

Frames containing object hypotheses are considered as annotated data to generate context exemplars, as object trajectory hypotheses normally capture essential parts of video objects. Let $\mathcal {F}$ be this set of annotated frames and $\mathcal {\hat{F}}$ be all the other frames in current video sequence. A context exemplar consists of a pair of regions and the corresponding semantic labels. The intuition behind this setting is that one region with its semantic label supports the paired region to be labeled with its corresponding semantic label. This exemplar is able to encode the global interaction and co-occurrence of semantic objects beyond local spatial adjacencies. The goal is to impose the consistency between each pair of regions from un-annotated frames and the extracted context exemplars.

Formally, given a set of semantic labels $\mathcal {C} = \{c_0, c_1, \dots , c_{L-1}\}$ comprising all classes in the annotated data, we represent the context exemplars for each class pair $(c_m, c_n)$ as

$$\begin{aligned} \mathbf {A^{m,n}} = \{(v_i, v_j): C(v_i) = c_m, C(v_j) = c_n, v_i, v_j \in \mathcal {F} \} \end{aligned}$$

where $v_i, v_j \in \mathcal {F}$ stands for two regions $v_i$ and $v_j$ from the annotated frame set $\mathcal {F}$ and $C(v_i)$ represents the semantic label of region $v_i$. Hence, all object class pairs as well as contextual relationships in the annotated frames are represented as $\mathcal {A} = \{\mathbf {A}^{0,0}, \mathbf {A}^{0,1}, \dots , \mathbf {A}^{L-1,L-1}\}$.

We transform the above context exemplar to a context link view of contextual knowledge, where context exemplar $(v_i, v_j)$ can be referred to as a $(c_m, c_n)$-type link between two vertices on the similarity graph. Let $\mathcal {P}$ denote the set of $N\times N$ matrices, where a matrix $\mathbf {P}^{m,n}\in \mathcal {P}$ is associated with all $(c_m, c_n)$ class pair links. Each entry $[\mathbf {P}^{m,n}]_{i,j} \in \mathbf {P}^{m,n}$ indicates the confidence of $(c_m, c_n)$-link between two regions $v_i$ and $v_j$. The confidence ranging between 0 and 1 corresponds to the probability of the existence of a link, where 1 stands for high confidence of the existence of a link and 0 indicates the absence of a link. The $(c_m, c_n)$-links which have been observed within the annotated frames can be represented by another set of matrices $\mathbf {O}^{m,n}\in \mathcal {P}$ such that

$$\begin{aligned}{}[\mathbf {O}^{m,n}]_{i,j} = \left\{ \begin{array}{lll} 1 &{} \hbox {if} &{} (v_i, v_j) \in \mathbf {A^{m,n}} \\ 0 &{} \hbox {otherwise} &{} \end{array}\right. \end{aligned}$$

(1)

All the observed context link can be denoted as $\mathcal {O} = \{\mathbf {O}^{0,0}, \mathbf {O}^{0,1}, \dots , \mathbf {O}^{L-1,L-1}\}$.

2.4 Context Prediction

Given the above context link view of contextual knowledge, we formulate the context prediction problem as a task of link prediction problem which determines how probable a certain link exists in a graph. To this end, we predict $(c_m, c_n)$-links among the pairs of vertices from $\mathcal {R}_U$ based on $\mathbf {O}^{m,n}$ consistent to the intrinsic structure of the similarity graph. Specifically, we propagate $(c_m, c_n)$-links in $\mathbf {O}^{m,n}$ to estimate the strength of the pairs of vertices from $\mathcal {R}_U$. We drop the m, n suffix for clarity.

Directly solving the link prediction problem is impractical for video segmentation since the complexity is as high as $O(N^4)$. Hence we propose to decompose the link propagation problem into two separate label propagation processes. As described in Algorithm 1, row-wise link predication (step 13–14) is firstly performed, followed by column-wise link prediction (step 16–17). More specifically, the j-th row $\mathbf {O}^{j,.}$, i.e. the context exemplars associated with $v_j$, serves as an initial configuration of a label propagation problem [24] with respect to vertex $v_j$. Each row is handled separately as a binary label propagation which converges to $\mathbf {\hat{P}}_r$. It is observed that the label propagation does not apply to the rows of $\mathbf {O}$ corresponding to $\mathcal {R}_U$, and thus we only perform row-wise link propagation in rows corresponding to annotated regions, which is much less than N. For the column-wise propagation, the i-th converged row $[\mathbf {\hat{P}}_r]_i$ is used to initialize the configuration. After convergence of the column-wise propagation, the probability of $(c_m, c_n)$-link between two vertices of $\mathcal {R}_U$ is obtained.

2.5 Inference

We formulate semantic video object segmentation as a region labeling problem, where the learned context link scores $S (v_i, c_i, v_j, c_j)$ can be incorporated while assigning labels to the set of regions $\mathcal {R}_D \cup \mathcal {R}_U$. We adopt the fully connected CRF that is proved to be effective in encoding model contextual relationships between object classes.

Consider a random field $\mathbf {x}$ defined over a set of variables $\{x_0, \dots , x_{N-1}\}$, and the domain of each variable is a set of class labels $\mathcal {C} = \{c_0, c_1, \dots , c_{L-1}\}$. The corresponding Gibbs energy is

$$\begin{aligned} E(\mathbf {x}) = \sum _{i} \psi (x_i) + \sum _{i,j} \phi (x_i, x_j). \end{aligned}$$

(2)

The unary potential $\psi (x_i)$ is defined as the negative logarithm of the likelihood of assigning $v_i$ with label $x_i$. To obtain $\psi (x_i)$, we learn a SVM model based on hierarchical CNN features [9] by sampling from the annotated frames.

The pairwise potential $\phi (x_i, x_j)$ encodes the contextual relationships between the regions learned via link prediction, which is defined as

$$\begin{aligned} \phi (x_i, x_j) = \exp (-\frac{S (v_i, c_i, v_j, c_j)^2}{2\beta }) \end{aligned}$$

(3)

where $\beta = <S (v_i, c_i, v_j, c_j)^2>$ is the adaptive weight and $<\cdot>$ indicates the expectation.

We adopt a combined QPBO and $\alpha $-expansion inference (a.k.a fusion moves) [?] to optimize (2) and the resulting label assignment gives the semantic object segmentation of the video sequence.

3 Experiments

We evaluate our proposed approach on YouTube-Objects [10], which is the de facto benchmark for assessing semantic video object segmentation algorithms. The class labels of these two densely labeled datasets belong to the 20 classes of PASCAL VOC 2012. The YouTube-Objects dataset consists of videos from 10 classes with pixel-level ground truth for totally more than 20, 000 frames. These videos are very challenging and completely unconstrained, with objects of similar colour to the background, fast motion, non-rigid deformations, and fast camera motion. We compare our approach with six state-of-the-art semantic video object segmentation methods which have been reported on this dataset, i.e. [10] (ODW), [13] (DSA), [23] (SOD), [16] (SDA), [2] (DTS) and [18] (CGG). Standard average IoU is used to measure the segmentation accuracy, $IoU = \frac{S \cap G}{S \cup G}$, where S is the segmentation result and G stands for the ground-truth mask.

Table 1. Intersection-over-union overlap on YouTube-Objects Dataset

Full size table

We summarize the comparisons of our algorithm with other approaches in Table 1. Table 1 demonstrates the superior performance of our proposed algorithm which surpasses the competing methods in all classes, with a significant increase of segmentation accuracy, i.e. $3.2\%$ in average over the best competing method CGG. We attribute this improvement to the capability of learning and propagating higher-level spatial-temporal contextual relationships of video objects, as opposed to imposing contextual information in local labeling (CGG) or modeling local appearance (SDA). One common limitation of these methods is that they are error-prone in separating interacting objects exhibiting similar appearance or motion, which is intractable unless the inherent contextual relationship is explored.

Our algorithm outperforms another two methods which also utilize object detection, i.e. DTS and SOD, with large margins of $11.3\%$ and $13.4\%$. DTS shares some similarity with our approach in that it also uses faster R-CNN for the initial object detection which makes it a comparable baseline to demonstrate the effectiveness of our algorithm. By exploiting contextual relationships in a global manner, our algorithm is able to account for object evolutions in the video data to resolve both appearance and motion ambiguities. SOD performs the worst among the three as it only conducts temporal association of detected object segments without explicitly modeling either the objects or contexts. Some qualitative results of the proposed algorithm on YouTube-Objects dataset are shown in Fig. 2.

4 Conclusion

We have proposed a novel approach to modeling the semantic contextual relationships for tackling the challenging video object segmentation problem. The proposed model comprises an exemplar-based nonparametric view of contextual cues, which is formulated as link prediction problem solved by label propagation on a similarity graph of regions. The derived contextual relationships are utilized to estimate the pairwise contexts between all pairs of unlabeled local regions. The experiments demonstrated that modeling the semantic contextual relationships effectively improved segmentation robustness and accuracy which significantly advanced the state-of-the-art on challenging benchmark.

References

Arbeláez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR, pp. 328–335 (2014)
Google Scholar
Drayer, B., Brox, T.: Object detection, tracking, and motion segmentation for object-level video segmentation. arXiv preprint arXiv:1608.03066 (2016)
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vis. 59(2), 167–181 (2004)
Article Google Scholar
Gidaris, S., Komodakis, N.: Object detection via a multi-region and semantic segmentation-aware CNN model. In: ICCV, pp. 1134–1142 (2015)
Google Scholar
Hartmann, G., et al.: Weakly supervised learning of object segmentations from web-scale video. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012. LNCS, vol. 7583, pp. 198–208. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33863-2_20
Chapter Google Scholar
Lin, G., Shen, C., van den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR, pp. 3194–3203 (2016)
Google Scholar
Liu, B., He, X.: Multiclass semantic video segmentation with object-level active inference. In: CVPR, pp. 4286–4294 (2015)
Google Scholar
Liu, X., Tao, D., Song, M., Ruan, Y., Chen, C., Bu, J.: Weakly supervised multiclass video segmentation. In: CVPR, pp. 57–64 (2014)
Google Scholar
Ma, C., Huang, J.B., Yang, X., Yang, M.H.: Hierarchical convolutional features for visual tracking. In: ICCV, pp. 3074–3082 (2015)
Google Scholar
Prest, A., Leistner, C., Civera, J., Schmid, C., Ferrari, V.: Learning object class detectors from weakly annotated video. In: CVPR, pp. 3282–3289 (2012)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tang, K.D., Sukthankar, R., Yagnik, J., Li, F.: Discriminative segment annotation in weakly labeled video. In: CVPR, pp. 2483–2490 (2013)
Google Scholar
Tang, P., Wang, C., Wang, X., Liu, W., Zeng, W., Wang, J.: Object detection in videos by short and long range object linking. arXiv preprint arXiv:1801.09823 (2018)
Taylor, B., Ayvaci, A., Ravichandran, A., Soatto, S.: Semantic video segmentation from occlusion relations within a convex optimization framework. In: Heyden, A., Kahl, F., Olsson, C., Oskarsson, M., Tai, X.-C. (eds.) EMMCVPR 2013. LNCS, vol. 8081, pp. 195–208. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40395-8_15
Chapter Google Scholar
Wang, H., Raiko, T., Lensu, L., Wang, T., Karhunen, J.: Semi-supervised domain adaptation for weakly labeled semantic video object segmentation. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10111, pp. 163–179. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54181-5_11
Chapter Google Scholar
Wang, H., Wang, T.: Primary object discovery and segmentation in videos via graph-based transductive inference. Comput. Vis. Image Underst. 143(2), 159–172 (2016)
Article Google Scholar
Wang, H., Wang, T., Chen, K., Kämäräinen, J.K.: Cross-granularity graph inference for semantic video object segmentation. In: IJCAI, pp. 4544–4550 (2017)
Google Scholar
Wang, T.: Submodular video object proposal selection for semantic object segmentation. In: ICIP (2017)
Google Scholar
Wang, T., Collomosse, J.P.: Probabilistic motion diffusion of labeling priors for coherent video segmentation. IEEE Trans. Multimed. 14(2), 389–400 (2012)
Article Google Scholar
Wang, T., Han, B., Collomosse, J.P.: TouchCut: fast image and video segmentation using single-touch interaction. Comput. Vis. Image Underst. 120, 14–30 (2014)
Article Google Scholar
Wang, T., Wang, H.: Graph transduction learning of object proposals for video object segmentation. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 553–568. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16817-3_36
Chapter Google Scholar
Zhang, Y., Chen, X., Li, J., Wang, C., Xia, C.: Semantic object segmentation via detection in weakly labeled video. In: CVPR, pp. 3641–3649 (2015)
Google Scholar
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Sch, B.: Learning with local and global consistency. In: NIPS, pp. 321–328 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Nokia Technologies, Tampere, Finland
Tinghuai Wang
Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland
Huiling Wang

Authors

Tinghuai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Huiling Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huiling Wang .

Editor information

Editors and Affiliations

Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Ruben Vera-Rodriguez
Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Julian Fierrez
Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Aythami Morales

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, T., Wang, H. (2019). Non-parametric Contextual Relationship Learning for Semantic Video Object Segmentation. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2018. Lecture Notes in Computer Science(), vol 11401. Springer, Cham. https://doi.org/10.1007/978-3-030-13469-3_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-13469-3_38
Published: 03 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13468-6
Online ISBN: 978-3-030-13469-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Non-parametric Contextual Relationship Learning for Semantic Video Object Segmentation

Abstract

Similar content being viewed by others