Full length article
Video object segmentation via random walks on two-frame graphs comprising superpixels

https://doi.org/10.1016/j.jvcir.2021.103293Get rights and content

Highlights

  • The proposed method is superior to several recent top-performing algorithms.

  • Random walks are employed on the graph constructed on two consecutive frames.

  • A strategy for adjusting the superpixel number of the method is designed.

Abstract

We propose a novel video object segmentation method employing random walkers to travel on graphs constructed on two consecutive frames. First, we estimate the initial foreground and background distributions by minimising an energy function that incorporates the stationary distributions of the random walks. The random walkers frequently travel between similar nodes of the graph constructed on two adjacent frames, which enables the incorporation of the inter-frame information into the energy function effectively and elegantly. Then, we refine the initial results by simulating the movements of multiple random walkers. We process the sequence in a recursive manner, which naturally propagates the previous segmentation labels to the subsequent frames. Additionally, we develop a strategy for adjusting the superpixel number using region similarity and the average Frobenius norm of optical flow gradient. This strategy can improve performance significantly. Furthermore, we discuss the feature selection problem in the method to select a more effective feature representation. Extensive and comparable experiments on Segtrack and Segtrack v2 demonstrate that the proposed algorithm yields higher performance than several recent state-of-the-art approaches.

Introduction

Video object segmentation [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], which is important in computer vision, is attracting increasing research attention. It focuses on the object of interest rather than all the objects in the video. The segmentation result can be a single object or multiple objects. Moreover, the object to be segmented is called the foreground, and the others are called the background. It is important for various applications including object tracking, activity recognition, visual enhancement and content retrieval. Specifically, video object segmentation approaches can be classified into unsupervised approaches, semi-supervised approaches, and supervised approaches, according to the level of supervision required. Unsupervised approaches automatically segment the primary object without manual annotation. Typically, they assume that the motion of the object to be segmented is dissimilar from the surroundings. Semi-supervised approaches require manual annotations at the first frame. They segment and trace objects in subsequent frames with this manual labelling. Supervised approaches are used for particular scenarios. They require a human to correct the algorithm results repeatedly during the segmentation process. The proposed approach belongs to semi-supervised approaches. Moreover, video segmentation approaches have other classification norms such as Section 2.

Superpixels generation is an important preprocessing stage of many computer vision applications. Superpixels can be used as mid-level features while improving the computational efficiency of the algorithm, so it is introduced to video object segmentation. SLIC [12] is a popular superpixel segmentation approach, which generates superpixels by iteratively applying simple K-means clustering. SLIC exploits simultaneously colour and coordinate information. Many improved superpixel segmentation algorithms [13], [14], [15], [16] have been proposed. Shen et al. [16] proposed a fast image superpixel segmentation algorithm. They introduced the density based spatial clustering of applications with noise (DBSCAN) algorithm to superpixels generation because DBSCAN helps to segment complex and irregularly shaped objects. Moreover, random walk based algorithms [17], [18], [19] were introduced for image segmentation and superpixel segmentation because the random walk algorithm is straightforward to implement and performs efficiently.

Recently, researchers extended the random walk algorithm to video object segmentation. Faktor and Irani proposed a robust video object segmentation approach by non-local consensus voting [1]. It performs effectively even in the case of low resolution, highly non-rigid motion, large scale and illumination changes. They first generated the superpixels for each frame. The superpixels of a certain number of frames constituted a set. After defining the neighbours of a superpixel, they connected the superpixel with its neighbours. Then, the similarity matrix called the random-walk transition matrix was constructed for the superpixels in the set. The probability distribution of the superpixels in the set attained a stationary distribution after a sufficient number of probability transitions. The segmentation results of these frames can be computed using this stationary distribution. Jang and Kim [20] proposed a semi-supervised video object segmentation approach. They also over-segmented each frame into superpixels. They designed an initial segmentation scheme that minimises an energy function. The energy function contains two Markov energy terms. The two Markov terms compel the probabilities to be distributed based on the stationary distribution of a random walk. Then, they refined the initial results using multiple random walkers. They exploited the stationary distribution of a random walk to assign an identical segmentation label to the superpixels with similar features.

However, in the initial segmentation, Jang and Kim [20] constructed the graph on a single frame, which cannot effectively exploit the inter-frame consistency. The method proposed by Jang and Kim utilises random walk formulation two times. First, the stationary distributions of the random walks are incorporated into the energy function as two Markov terms. The energy function is formulated to perform the initial segmentation. Second, it simulates the transitions of multiple random walkers to refine the initial segmentation results. To prevent ambiguity, it must be stressed that our discussions are concerned with the formulation of random walks in the initial segmentation. For convenient notation, we refer to the method proposed by Jang and Kim [20] as the baseline method throughout the remaining part of the paper.

Then, how they set an identical number of superpixels for all the video sequences need improvement. Superpixels have been commonly used for numerous computer vision algorithms such as object localisation [21], multi-target tracking [22], [23], and video object segmentation [1], [8], [10], [20], [24], [25]. They are convenient primitives and effective for capturing image redundancy. Because the number of superpixels is smaller than the number of pixels, the use of superpixels significantly reduces the complexity. The number of superpixels in these algorithms is a hyperparameter [1], [10], [20], [24]. These algorithms cannot make certain changes to the number of superpixels by themselves. The number of superpixels in the baseline method is also a hyperparameter. They set an identical number of superpixels for all the video sequences. We observe that the near-optimal number of superpixels varies across the video sequences after conducting several experiments. The near-optical number of superpixels of the method varies across the sequences when the baseline method achieves near-optimal performance.

Finally, we discuss the feature selection problem in the baseline method. Video object segmentation algorithms typically utilise motion and appearance information to complete the segmentation task. Colour descriptors and motion vectors are common forms representing appearance and motion information, respectively. Specifically, the baseline method uses the average LAB colour and the average optical flow of the superpixel to represent appearance and motion features, respectively. There are several alternative representations of colours. Motion features can also be represented using histograms of oriented optical flow (HOOF). Therefore, we can change the original feature representation of the algorithm. For example, the original colour feature can be replaced with the median LAB colour or maximum LAB colour of the superpixel. Moreover, it can be replaced with the average RGB colour, LAB colour histogram, or RGB colour histogram of the superpixel. Similarly, the motion feature can also be expressed as the minimum optical flow, maximum optical flow, HOOF, etc., of the superpixel. Khoreva and Galasso et al. discussed the issue of superpixel feature selection based on a graph-based method [26]. They considered 14 appearance- or motion-based features from state-of-the-art video object segmentation methods [2], [4], [5], [6], [27], [28], [29]. They concluded that the median LAB colour and the median optical flow are two of the most contributive features. We now address the following question: Can we replace the alternative features to improve the performance of the baseline method.

We conduct targeted work to solve the three aforementioned problems. First, we extend the graph onto two adjacent frames, which naturally introduces constraints of spatiotemporal consistency. Compared with the graph established on one frame in the baseline method [20], our graph is constructed on two frames. Our graph stimulates the probability transfer between the similar superpixels belonging to the same object located in the two frames. That is, two superpixels from two distant frames can be connected when their attributes are similar.

Second, we design a strategy for adjusting the superpixel number. By incorporating this strategy, the algorithm can to a certain extent set the number of superpixels for different video sequences adaptively without manual tuning. Specifically, the adjustment strategy can be divided into two parts: relative score calculation and search strategy. Given the video sequence and the superpixel number of the algorithm, the algorithm is executed to obtain the segmentation results of the first three frames of the sequence. Then, we can calculate the relative score by applying the measurement formulas to the segmentation results of the first three frames. The measurement formulas are derived from the regional similarity and the average Frobenius norm of optical flow gradient around the boundary of the object. The search strategy includes the value range, interval, search direction, starting value and stop mode of the number of superpixels. A detailed description of relative score calculation and search strategy is provided in Section 3.2.

Finally, we conduct several comparative experiments to select a more effective feature representation. We use the median value of the LAB colour and optical flow of the superpixel as the appearance and motion feature. The median value is changed to the minimum, the average, and the maximum in turn, and the experiments are carried out. Moreover, the appearance feature can be represented using the RGB colour, LAB colour histogram, or RGB colour histogram. HOOF can be adopted to represent the motion feature. We conduct adequate experiments using different combinations of colour and motion features. We observe that the application of the average LAB colour and the average optical flow of the superpixel allows the algorithm to achieve higher performance under the current framework of the algorithm.

In summary, there are three contributions of this study:

  • 1.

    We extend the underlying graph onto two consecutive frames of the baseline method in the initial segmentation. Compared with the graph established on one frame, our graph stimulates the probability transfer between the similar superpixels belonging to the same object located in the two frames, which introduces the inter-frame information effectively and elegantly.

  • 2.

    We design a strategy for adjusting the superpixel number by computing the region similarity and the average Frobenius norm of optical flow gradient. Experiments demonstrate that incorporating this strategy can determine a near-optimal number of superpixels, which significantly improves the algorithm performance.

  • 3.

    We discuss the feature selection problem and conclude that the method using the average LAB colour and the average optical flow of the superpixel can achieve higher performance under the framework of this algorithm.

This paper is organised as follows: Related work in the literature is described in Section 2. The proposed method is covered in Section 3. Section 3.1 describes the segmentation step of the algorithm, including the construction of the two-frame graph and the design of the energy function. The energy function is utilised to perform the initial segmentation. Then, the refinement of the initial segmentation results by simulating the transitions of multiple random walkers is also described in this section. Section 3.2 presents our strategy for adjusting the superpixel number using region similarity and the average Frobenius norm of optical flow gradient. Section 3.3 briefly discusses the feature selection problem. The quantitative and qualitative results of the experiments are presented in Section 4. Finally, we summarise this work in Section 5.

Section snippets

Related work

In this section we review several categories of video object segmentation methods.

Occlusion-based Methods. Occlusion relations imply the grouping of the image domain into ‘objects’, which have been used for partitioning of video frames [30], [31], [32], [33], [34], [35]. Object or viewer motion causes occlusion relations to occur. Depth layers can be inferred from occlusion relations. Layer A is, therefore, closer to the viewer than Layer B if the image region in Layer B is occupied by the

Proposed method

Firstly, we describe the segmentation step of the algorithm (Section 3.1). Then, the strategy for adjusting the superpixel number by considering region similarity and the average Frobenius norm of optical flow gradient is presented in Section 3.2. Finally, we discuss the feature selection problem briefly (Section 3.3).

Our video object segmentation method is semi-supervised. The input is a pixel-level object annotation of the first frame in addition to all the video frames. We obtain object

Experiments

In this section, we illustrate the performance of the proposed method. First, the datasets and the evaluation methodology corresponding to them are briefly described. Second, the overall comparison of the proposed approach with several recent state-of-the-art algorithms is presented. Third, we compare the proposed method with the baseline progressively. Finally, the results and analysis of the different combinations of colour and motion features are stated.

Conclusions

We proposed a semi-supervised video object segmentation method. The proposed method first achieves initial segmentation by exploiting stationary distributions of the random walks on graphs constructed on two consecutive frames. Transitions on the graph constructed on two adjacent frames can effectively utilise inter-frame consistency by stimulating probability transfer between similar nodes of the same object located in the two frames. Then, the initial results are refined by using multiple

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research is partially supported by the Beijing Natural Science Foundation, China (No. 4212025), National Natural Science Foundation of China (No. 61876018, No. 61976017).

References (90)

  • Y.J. Koh, C.S. Kim, Primary object segmentation in videos based on region augmentation and reduction, in: Proc. IEEE...
  • WangW. et al.

    Saliency-aware video object segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • X. Chen, Z. Li, Y. Yuan, G. Yu, J. Shen, D. Qi, State-aware tracker for real-time video object segmentation, in: Proc....
  • AchantaR. et al.

    SLIC superpixels compared to state-of-the-art superpixel methods

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • ShenJ. et al.

    Interactive segmentation using constrained Laplacian optimization

    IEEE Trans. Circuits Syst. Video Technol.

    (2014)
  • DongX. et al.

    Hierarchical superpixel-to-pixel dense matching

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • Z. Li, J. Chen, Superpixel segmentation using linear spectral clustering, in: Proc. IEEE Conf. Computer Vision and...
  • ShenJ. et al.

    Real-time superpixel segmentation by DBSCAN clustering algorithm

    IEEE Trans. Image Process.

    (2016)
  • ShenJ. et al.

    Lazy random walks for superpixel segmentation

    IEEE Trans. Image Process.

    (2014)
  • DongX. et al.

    Sub-markov random walk for image segmentation

    IEEE Trans. Image Process.

    (2016)
  • LiangY. et al.

    Video supervoxels using partially absorbing random walks

    IEEE Trans. Circuits Syst. Video Technol.

    (2016)
  • W.-D. Jang, C.-S. Kim, Semi-supervised video object segmentation using multiple random walkers, in: Proc. British...
  • B. Fulkerson, A. Vedaldi, S. Soatto, Class segmentation and object localization with superpixel neighborhoods, in:...
  • L. Liu, J. Xing, H. Ai, S. Lao, Semantic superpixel based vehicle tracking, in: Proc. IEEE Conf. Pattern Recognition...
  • A. Milan, L. Leal-Taixé, K. Schindler, I. Reid, Joint tracking and segmentation of multiple targets, in: Proc. IEEE...
  • L. Wen, D. Du, Z. Lei, S.Z. Li, M.-H. Yang, Jots: Joint online tracking and segmentation, in: Proc. IEEE Conf. Computer...
  • Y.-H. Tsai, M.-H. Yang, M.J. Black, Video segmentation via object flow, in: Proc. IEEE Conf. Computer Vision and...
  • A. Khoreva, F. Galasso, M. Hein, B. Schiele, Classifier based graph construction for video segmentation, in: Proc. IEEE...
  • HoiemD. et al.

    Recovering surface layout from an image

    Int. J. Comput. Vis.

    (2007)
  • OchsP. et al.

    Segmentation of moving objects by long term video analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • ArbelaezP. et al.

    Contour detection and hierarchical image segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • L. Bergen, F. Meyer, Motion segmentation and depth ordering based on morphological segmentation, in: Proc. European...
  • G.J. Brostow, I.A. Essa, Motion based decompositing of video, in: Proc. IEEE Int. Conf. Computer Vision (ICCV), vol. 1,...
  • AyvaciA. et al.

    Detachable object detection: Segmentation and depth ordering from short-baseline video

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • Y. Yang, G. Sundaramoorthi, Modeling self-occlusions in dynamic shape and appearance tracking, in: Proc. IEEE Int....
  • B. Taylor, V. Karasev, S. Soatto, Causal video object segmentation from persistence of occlusions, in: Proc. IEEE Conf....
  • C. Zach, D. Gallup, J.-M. Frahm, Fast gain-adaptive KLT tracking on the GPU, in: Workshops of IEEE Conference on...
  • SinhaS.N. et al.

    Feature tracking and matching in video using programmable graphics hardware

    Mach. Vis. Appl.

    (2011)
  • N. Sundaram, T. Brox, K. Keutzer, Dense point trajectories by GPU-accelerated large displacement optical flow, in:...
  • K. Fragkiadaki, G. Zhang, J. Shi, Video segmentation by tracing discontinuities in a trajectory embedding, in: Proc....
  • M. Keuper, B. Andres, T. Brox, Motion trajectory segmentation via minimum cost multicuts, in: Proc. IEEE Int. Conf....
  • ChenL. et al.

    Video object segmentation via dense trajectories

    IEEE Trans. Multimedia

    (2015)
  • ShenJ. et al.

    Submodular trajectories for better motion segmentation in videos

    IEEE Trans. Image Process.

    (2018)
  • LiuY. et al.

    Better dense trajectories by motion in videos

    IEEE T. Cybern.

    (2019)
  • WangW. et al.

    Semi-supervised video object segmentation with super-trajectories

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2019)
  • Cited by (4)

    • COVID-19 ground-glass opacity segmentation based on fuzzy c-means clustering and improved random walk algorithm

      2023, Biomedical Signal Processing and Control
      Citation Excerpt :

      Recently, the random walk algorithm has achieved a good result in medical image processing [12]. The random walk algorithm based on the graph theory can better identify the weak boundaries while minimizing the risk of the leaking boundaries with simple calculation and fast segmentation [13]. However, the traditional random walk algorithm needs to set a large number of seed points manually, and its application is limited.

    • The Research of Retinopathy Image Recognition Method Based on Vit

      2022, ACM International Conference Proceeding Series
    • Efficient Unsupervised Video Object Segmentation Network Based on Motion Guidance

      2022, Proceedings - 2022 10th International Conference on Information Systems and Computing Technology, ISCTech 2022

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text