Elsevier

Information Sciences

Volume 576, October 2021, Pages 819-830
Information Sciences

Integrating object proposal with attention networks for video saliency detection

https://doi.org/10.1016/j.ins.2021.08.069Get rights and content

Abstract

Video saliency detection is an active research issue in both information science and visual psychology. In this paper, we propose an efficient video saliency-detection model, based on integrating object-proposal with attention networks, for efficiently capturing salient objects and human attention areas in the dynamic scenes of videos. In our algorithm, visual object features are first exploited from individual video frame, using real-time neural networks for object detection. Then, the spatial position information of each frame is used to screen out the large background in the video, so as to reduce the influence of background noises. Finally, the results, with backgrounds removed, are further refined by spreading the visual clues through an adaptive weighting scheme into the later layers of a convolutional neural network. Experimental results, conducted on widespread and commonly used databases for video saliency detection, verify that our proposed framework outperforms existing deep models.

Introduction

In recent years, visual saliency detection (VSD) has triggered broad academic research in machine learning and computer vision [1], [2], [3], and is an important technique for many real-world applications [11], [20], [21], [22], [23], [24], [25], [26]. The aim of video saliency detection is to perceive and discover conspicuous objects/targets in a video sequence, simulated by the visual attention mechanism. Traditional static image saliency-detection methods have achieved impressive performances for various practical tasks [7], [12], [13], [14], [30], [31], [48]. However, in contrast to image saliency detection, it remains an intractable task to exploit the consistency of spatiotemporal features for video saliency detection. The main cause behind this is due to the complicated dynamic relations between the frames in a video sequence. Compared with static images, the conspicuous objects in an image are quiescent and motionless [21], [22], [23], [24], [25], [26], [27], [28], [29], [32], [33], [34], [35], [36], [37], [38], [39], [40]. In a video sequence, these attractive objects in the successive frames are steadily altering and gradually evolving as time goes by. Therefore, the key of saliency modelling for intra-frames and inter-frames is to constantly discover those relevant, remarkable and moving objects, via the simultaneous consideration of spatial and temporal clues, which is currently still an open problem and remains a challenge for the research community [42], [43], [44], [45], [46], [47].

In this paper, we propose a model, which integrates object proposal with attention networks via visual selectivity in computing saliency. The main novelties and contributions of our proposed method are as follows:

  • The YOLO model is used to roughly select salient object proposals. The object spatial position prior can not only improve the detection accuracy, but also wipe off the irrelevant background noises simultaneously.

  • With the aid of spatial cues from object proposals, the alpha channel feature is added to alleviate the unfavourable effect of the complex background in the video frames.

  • To further highlight salient objects with temporal consistency, a weight sharing strategy is proposed, which uses an attention mechanism to capture the spatiotemporal features in a video sequence, so as to refine the quality of the final saliency maps.

The remainder of this paper is arranged as follows. Section 2 is an overview of different salient-object detection models. Section 3 introduces meticulously the proposed video saliency detection framework. Section 4 presents the experiment results for our proposed method and state-of-the-art methods for salient object detection on benchmark data sets. A conclusion is drawn in Section 5.

Section snippets

Related work

Recently, video saliency detection has inspired wide interests of researchers in different disciplines. Seo and Milanfar [1] proposed an effective approach for spatiotemporal video saliency detection. In the algorithm, a bottom-up model was devised based on low-level contrastive clues of an input frame, by estimating the degree of saliency for each pixel, in view of the surrounding neighbourhoods. Later, Xi et al. [2] proposed to apply the background visual cue in static images to video

The proposed method

The proposed framework is illustrated in Fig. 2, which is composed of three parallel attention networks, as shown on the right. The details of the input and the three networks will be given in 3.1 Preprocessing of video frames, 3.2 Deep networks for saliency detection, respectively. Specifically, the preprocessing of a video sequence and the generation of object proposals are presented in Section 3.1. Then, the attention networks for inter-frame saliency object detection is described in Section

Experiments

To evaluate the performance of the proposed video saliency-detection framework, we will describe, in detail, the data sets used in our experiments, the evaluation metrics, the state-of-the-art saliency-detection methods to be compared, and the evaluation protocol.

Conclusion

In this paper, we have exploited deep attention networks for video saliency detection. In the proposed model, the information about the spatial location of potential object proposals can be used to effectively filter out background noises. Furthermore, based on a weight-sharing mechanism, the consistency of the saliency maps between consecutive frames can be improved effectively by capturing the spatial and temporal features in the dynamic video scenes. Extensive experiments have been performed

CRediT authorship contribution statement

Muwei Jian: Conceptualization, Methodology, Software, Writing – original draft. Jiaojin Wang: Conceptualization, Software, Visualization, Investigation. Hui Yu: Supervision, Validation, Visualization. Gai-Ge Wang: Supervision, Validation, Data curation, Writing – original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

We would like to thank Prof. K. M. Lam in the Department of Electronic and Information Engineering, Hong Kong Polytechnic University, for providing technical editing and proofreading of the manuscript.

This work was supported by National Natural Science Foundation of China (NSFC) (61976123, 61601427); Taishan Young Scholars Program of Shandong Province; Royal Society - K. C. Wong International Fellowship (NIF\R1\180909); and Key Development Program for Basic Research of Shandong Province

Muwei Jian received the PhD degree from the Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, in October 2014. He was a Lecturer with the Department of Computer Science and Technology, Ocean University of China, from 2015 to 2017. Currently, Dr. Jian is a Professor and Ph.D Supervisor at the School of Computer Science and Technology, Shandong University of Finance and Economics.

His current research interests include human face recognition, image and

References (48)

  • C. Chen et al.

    Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion

    IEEE Trans. Image Process.

    (2017)
  • J. Redmon et al.

    YOLOv3: An Incremental Improvement

    (2018)
  • W. Wang et al.

    Video salient object detection via fully convolutional networks

    IEEE Trans. Image Process.

    (2018)
  • Y. Fang et al.

    A video saliency detection model in compressed domain

    IEEE Trans. Circuits Syst. Video Technol.

    (2014)
  • W. Wang et al.

    Saliency-aware geodesic video object segmentation

    IEEE CVPR

    (2015)
  • L. Itti et al.

    A model of saliency based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • Federico Perazzi et al.

    Saliency filters: contrast based filtering for salient region detection

  • Trung-Nghia Le et al.

    Deeply supervised 3d recurrent FCN for salient object detection in videos

    BMVC

    (2017)
  • X. Zhang et al.

    Progressive attention guided recurrent network for salient object detection

  • H. Kim et al.

    Spatiotemporal saliency detection for video sequences based on random walk with restart

    IEEE Trans. Image Process.

    (2015)
  • Siyang Li et al.

    Unsupervised video object segmentation with motion-based bilateral networks

    ECCV

    (2018)
  • R. Achanta et al.

    Frequency-tuned salient region detection

  • E. Rahtu et al.

    Segmenting salient objects from images and videos

  • Z.X. Ren et al.

    Region-Based saliency detection and its application in object recognition

    IEEE Trans. Circuits Syst. Video Technol.

    (2014)
  • Cited by (38)

    • A graph-based top-down visual attention model for lockwire detection via multiscale top-hat transformation

      2023, Expert Systems with Applications
      Citation Excerpt :

      Although these morphological filter-based methods can suppress background efficiently, it is difficult to design an appropriate structuring element that could match complex and varied backgrounds. Due to its properties including visual attention mechanisms (Jian et al., 2021) and contrast mechanisms (Jian et al., 2018), HVS-based methods have been widely used in small objects detection. ( Qi et al., 2016) calculated a saliency map to enhance small objects by combining gradient enhancement operation with Gaussian smoothing. (

    • Audio–visual collaborative representation learning for Dynamic Saliency Prediction

      2022, Knowledge-Based Systems
      Citation Excerpt :

      The saliency prediction task aims to automatically predict the most prominent area in the scene by simulating the human selective attention mechanism, which provides an alternative for obtaining the most valuable information from a massive of data. The task has served as an important research topic in the field of computer vision, and has great applications in many fields, such as scene understanding [1,2], object detection [3,4], object tracking [5], image quality evaluation [6], automatic contrast enhancement [7], and video compression [8]. In the field of computer vision, the saliency prediction task draws increasing attention, and lots of methods have been proposed in recent years [9–13].

    • A survey of micro-video analysis

      2024, Multimedia Tools and Applications
    View all citing articles on Scopus

    Muwei Jian received the PhD degree from the Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, in October 2014. He was a Lecturer with the Department of Computer Science and Technology, Ocean University of China, from 2015 to 2017. Currently, Dr. Jian is a Professor and Ph.D Supervisor at the School of Computer Science and Technology, Shandong University of Finance and Economics.

    His current research interests include human face recognition, image and video processing, machine learning and computer vision. Prof. Jian was actively involved in professional activities. He has been a member of the Program Committee and Special Session Chair of several international conferences, such as SNPD 2007, ICIS 2008, APSIPA 2015, EEECS 2016, ICTAI2016, ICGIP 2016, ICTAI 2017 and ICTAI 2018. Dr. Jian has also served as a reviewer for several international SCI-indexed journals, including IEEE Trans., Pattern Recognition, Information Sciences, Computers in Industry, Machine Vision and Applications, Machine Learning and Cybernetics, The Imaging Science Journal, and Multimedia Tools and Applications. Prof. Jian holds 3 granted national patents and has published over 40 papers in refereed international leading journals/conferences such as IEEE Trans. on Cybernetics, IEEE Trans. on Circuits and Systems for Video Technology, Pattern Recognition, Information Sciences, Signal Processing, ISCAS, ICME and ICIP.

    Jiaojin Wang is pursuing his Master's degree supervised by Prof. Muwei Jian, at the School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan, China. His research interests include image processing, pattern recognition, and computer vision.

    Hui Yu is Professor with the University of Portsmouth, UK. His research interests include vision, computer graphics and application of machine learning and AI to above areas, particularly in human machine interaction, image processing and recognition, Virtual and Augmented reality, 3D reconstruction, robotics and geometric processing of facial performances. He serves as an Associate Editor of IEEE Transactions on Human-Machine Systems and the Neurocomputing journal.

    View full text