Integrating object proposal with attention networks for video saliency detection
Introduction
In recent years, visual saliency detection (VSD) has triggered broad academic research in machine learning and computer vision [1], [2], [3], and is an important technique for many real-world applications [11], [20], [21], [22], [23], [24], [25], [26]. The aim of video saliency detection is to perceive and discover conspicuous objects/targets in a video sequence, simulated by the visual attention mechanism. Traditional static image saliency-detection methods have achieved impressive performances for various practical tasks [7], [12], [13], [14], [30], [31], [48]. However, in contrast to image saliency detection, it remains an intractable task to exploit the consistency of spatiotemporal features for video saliency detection. The main cause behind this is due to the complicated dynamic relations between the frames in a video sequence. Compared with static images, the conspicuous objects in an image are quiescent and motionless [21], [22], [23], [24], [25], [26], [27], [28], [29], [32], [33], [34], [35], [36], [37], [38], [39], [40]. In a video sequence, these attractive objects in the successive frames are steadily altering and gradually evolving as time goes by. Therefore, the key of saliency modelling for intra-frames and inter-frames is to constantly discover those relevant, remarkable and moving objects, via the simultaneous consideration of spatial and temporal clues, which is currently still an open problem and remains a challenge for the research community [42], [43], [44], [45], [46], [47].
In this paper, we propose a model, which integrates object proposal with attention networks via visual selectivity in computing saliency. The main novelties and contributions of our proposed method are as follows:
- •
The YOLO model is used to roughly select salient object proposals. The object spatial position prior can not only improve the detection accuracy, but also wipe off the irrelevant background noises simultaneously.
- •
With the aid of spatial cues from object proposals, the alpha channel feature is added to alleviate the unfavourable effect of the complex background in the video frames.
- •
To further highlight salient objects with temporal consistency, a weight sharing strategy is proposed, which uses an attention mechanism to capture the spatiotemporal features in a video sequence, so as to refine the quality of the final saliency maps.
The remainder of this paper is arranged as follows. Section 2 is an overview of different salient-object detection models. Section 3 introduces meticulously the proposed video saliency detection framework. Section 4 presents the experiment results for our proposed method and state-of-the-art methods for salient object detection on benchmark data sets. A conclusion is drawn in Section 5.
Section snippets
Related work
Recently, video saliency detection has inspired wide interests of researchers in different disciplines. Seo and Milanfar [1] proposed an effective approach for spatiotemporal video saliency detection. In the algorithm, a bottom-up model was devised based on low-level contrastive clues of an input frame, by estimating the degree of saliency for each pixel, in view of the surrounding neighbourhoods. Later, Xi et al. [2] proposed to apply the background visual cue in static images to video
The proposed method
The proposed framework is illustrated in Fig. 2, which is composed of three parallel attention networks, as shown on the right. The details of the input and the three networks will be given in 3.1 Preprocessing of video frames, 3.2 Deep networks for saliency detection, respectively. Specifically, the preprocessing of a video sequence and the generation of object proposals are presented in Section 3.1. Then, the attention networks for inter-frame saliency object detection is described in Section
Experiments
To evaluate the performance of the proposed video saliency-detection framework, we will describe, in detail, the data sets used in our experiments, the evaluation metrics, the state-of-the-art saliency-detection methods to be compared, and the evaluation protocol.
Conclusion
In this paper, we have exploited deep attention networks for video saliency detection. In the proposed model, the information about the spatial location of potential object proposals can be used to effectively filter out background noises. Furthermore, based on a weight-sharing mechanism, the consistency of the saliency maps between consecutive frames can be improved effectively by capturing the spatial and temporal features in the dynamic video scenes. Extensive experiments have been performed
CRediT authorship contribution statement
Muwei Jian: Conceptualization, Methodology, Software, Writing – original draft. Jiaojin Wang: Conceptualization, Software, Visualization, Investigation. Hui Yu: Supervision, Validation, Visualization. Gai-Ge Wang: Supervision, Validation, Data curation, Writing – original draft.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
We would like to thank Prof. K. M. Lam in the Department of Electronic and Information Engineering, Hong Kong Polytechnic University, for providing technical editing and proofreading of the manuscript.
This work was supported by National Natural Science Foundation of China (NSFC) (61976123, 61601427); Taishan Young Scholars Program of Shandong Province; Royal Society - K. C. Wong International Fellowship (NIF\R1\180909); and Key Development Program for Basic Research of Shandong Province
Muwei Jian received the PhD degree from the Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, in October 2014. He was a Lecturer with the Department of Computer Science and Technology, Ocean University of China, from 2015 to 2017. Currently, Dr. Jian is a Professor and Ph.D Supervisor at the School of Computer Science and Technology, Shandong University of Finance and Economics.
His current research interests include human face recognition, image and
References (48)
- et al.
MCCH: a novel convex hull-based prior method for saliency detection
Inf. Sci.
(2019) - et al.
Facial-feature detection and localization based on a hierarchical scheme
Inf. Sci.
(2014) - et al.
Integrating QDWD with pattern distinctness and local contrast for underwater saliency detection
J. Vis. Commun. Image Represent.
(2018) - et al.
Saliency detection based on directional patches extraction and principal local color contrast
J. Vis. Commun. Image Represent.
(2018) - et al.
SSPNet: Learning spatiotemporal saliency prediction networks for visual tracking
Inf. Sci.
(2021) - et al.
CNN-based encoder-decoder networks for salient object detection: a comprehensive review and recent advances
Inf. Sci.
(2021) - et al.
A classifier-guided approach for top-down salient object detection
Signal Process. Image Commun.
(2016) - et al.
Static and space-time visual saliency detection by self-resemblance
J. Vision
(2009) - et al.
Salient object detection with spatiotemporal background priors for video
IEEE Trans. Image Process.
(2017) - et al.
Superpixel-based spatiotemporal saliency detection
IEEE Trans. Circuits Syst. Video Technol.
(2014)
Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion
IEEE Trans. Image Process.
YOLOv3: An Incremental Improvement
Video salient object detection via fully convolutional networks
IEEE Trans. Image Process.
A video saliency detection model in compressed domain
IEEE Trans. Circuits Syst. Video Technol.
Saliency-aware geodesic video object segmentation
IEEE CVPR
A model of saliency based visual attention for rapid scene analysis
IEEE Trans. Pattern Anal. Mach. Intell.
Saliency filters: contrast based filtering for salient region detection
Deeply supervised 3d recurrent FCN for salient object detection in videos
BMVC
Progressive attention guided recurrent network for salient object detection
Spatiotemporal saliency detection for video sequences based on random walk with restart
IEEE Trans. Image Process.
Unsupervised video object segmentation with motion-based bilateral networks
ECCV
Frequency-tuned salient region detection
Segmenting salient objects from images and videos
Region-Based saliency detection and its application in object recognition
IEEE Trans. Circuits Syst. Video Technol.
Cited by (38)
AMSUnet: A neural network using atrous multi-scale convolution for medical image segmentation
2023, Computers in Biology and MedicineSTI-Net: Spatiotemporal integration network for video saliency detection
2023, Information SciencesA graph-based top-down visual attention model for lockwire detection via multiscale top-hat transformation
2023, Expert Systems with ApplicationsCitation Excerpt :Although these morphological filter-based methods can suppress background efficiently, it is difficult to design an appropriate structuring element that could match complex and varied backgrounds. Due to its properties including visual attention mechanisms (Jian et al., 2021) and contrast mechanisms (Jian et al., 2018), HVS-based methods have been widely used in small objects detection. ( Qi et al., 2016) calculated a saliency map to enhance small objects by combining gradient enhancement operation with Gaussian smoothing. (
Audio–visual collaborative representation learning for Dynamic Saliency Prediction
2022, Knowledge-Based SystemsCitation Excerpt :The saliency prediction task aims to automatically predict the most prominent area in the scene by simulating the human selective attention mechanism, which provides an alternative for obtaining the most valuable information from a massive of data. The task has served as an important research topic in the field of computer vision, and has great applications in many fields, such as scene understanding [1,2], object detection [3,4], object tracking [5], image quality evaluation [6], automatic contrast enhancement [7], and video compression [8]. In the field of computer vision, the saliency prediction task draws increasing attention, and lots of methods have been proposed in recent years [9–13].
A survey of micro-video analysis
2024, Multimedia Tools and Applications
Muwei Jian received the PhD degree from the Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, in October 2014. He was a Lecturer with the Department of Computer Science and Technology, Ocean University of China, from 2015 to 2017. Currently, Dr. Jian is a Professor and Ph.D Supervisor at the School of Computer Science and Technology, Shandong University of Finance and Economics.
His current research interests include human face recognition, image and video processing, machine learning and computer vision. Prof. Jian was actively involved in professional activities. He has been a member of the Program Committee and Special Session Chair of several international conferences, such as SNPD 2007, ICIS 2008, APSIPA 2015, EEECS 2016, ICTAI2016, ICGIP 2016, ICTAI 2017 and ICTAI 2018. Dr. Jian has also served as a reviewer for several international SCI-indexed journals, including IEEE Trans., Pattern Recognition, Information Sciences, Computers in Industry, Machine Vision and Applications, Machine Learning and Cybernetics, The Imaging Science Journal, and Multimedia Tools and Applications. Prof. Jian holds 3 granted national patents and has published over 40 papers in refereed international leading journals/conferences such as IEEE Trans. on Cybernetics, IEEE Trans. on Circuits and Systems for Video Technology, Pattern Recognition, Information Sciences, Signal Processing, ISCAS, ICME and ICIP.
Jiaojin Wang is pursuing his Master's degree supervised by Prof. Muwei Jian, at the School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan, China. His research interests include image processing, pattern recognition, and computer vision.
Hui Yu is Professor with the University of Portsmouth, UK. His research interests include vision, computer graphics and application of machine learning and AI to above areas, particularly in human machine interaction, image processing and recognition, Virtual and Augmented reality, 3D reconstruction, robotics and geometric processing of facial performances. He serves as an Associate Editor of IEEE Transactions on Human-Machine Systems and the Neurocomputing journal.