Abstract
Visual saliency is the ability of a vision system to promptly select the most relevant data in the scene and reduce the amount of visual data that needs to be processed. Thus, its applications for complex tasks such as object detection, object recognition and video compression have attained interest in computer vision studies. In this paper, we introduce a novel unsupervised method for detecting visual saliency in videos of natural scenes. For this, we divide a video into non-overlapping cuboids and create a matrix whose columns correspond to intensity values of these cuboids. Simultaneously, we segment the video using a hierarchical segmentation method and obtain super-voxels. A dictionary learned from the feature data matrix of the video is subsequently used to represent the video as coefficients of atoms. Then, these coefficients are decomposed into salient and non-salient parts. We propose to use group lasso regularization to find the sparse representation of a video, which benefits from grouping information provided by super-voxels and extracted features from the cuboids. We find saliency regions by decomposing the feature matrix of a video into low-rank and sparse matrices by using robust principal component analysis matrix recovery method. The applicability of our method is tested on four video data sets of natural scenes. Our experiments provide promising results in terms of predicting eye movement using standard evaluation methods. In addition, we show our video saliency can be used to improve the performance of human action recognition on a standard dataset.
Similar content being viewed by others
References
Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. The Journal of Machine Learning Research, 9, 1179–1225.
Borji, A., & Itti, L. (2013). State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 185.
Borji, A., Sihite, D. N., & Itti, L. (2011). Computational modeling of top-down visual attention in interactive environments. In British Machine Vision Conference (pp. 1–12).
Borji, A., Sihite, D. N., & Itti, L. (2013). What stands out in a scene? A study of human explicit saliency judgment. Vision Research, 91, 62–77.
Bruce, N., & Tsotsos, J. (2005). Saliency based on information maximization. In Advances in Neural Information Processing Systems (pp. 155–162).
Bruce, N. D., & Tsotsos, J. K. (2009). Saliency, attention, and visual search: An information theoretic approach. Journal of Vision, 9(3), 5.
Dorr, M., Martinetz, T., Gegenfurtner, K. R., & Barth, E. (2010). Variability of eye movements when viewing dynamic natural scenes. Journal of Vision, 10(10), 28.
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15, 3736.
Frintrop, S., Rome, E., & Christensen, H. I. (2010). Computational visual attention systems and their cognitive foundations: A survey. ACM Transactions on Applied Perception (TAP), 7(1), 6.
Gao, D., Han, S., & Vasconcelos, N. (2009). Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(6), 989–1005.
Gao, D., & Vasconcelos, N. (2004). Discriminant saliency for visual recognition from cluttered scenes. In Advances in Neural Information Processing Systems (pp. 481–488).
Gao, D., & Vasconcelos, N. (2009). Decision-theoretic saliency: Computational principles, biological plausibility, and implications for neurophysiology and psychophysics. Neural Computation, 21(1), 239–271.
Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2141–2148).
Guo, C., Ma, Q., & Zhang, L. (2008). Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8).
Itti, L., & Baldi, P. (2005). A principled approach to detecting surprising events in video. In IEEE Conference on Computer Vision and Pattern Recognition.
Itti, L., & Baldi, P. (2009). Bayesian surprise attracts human attention. Vision Research, 49, 1295.
Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259.
Judd, T., Ehinger, K., Durand, F., & Torralba, A. (2009). Learning to predict where humans look. In IEEE International Conference on Computer Vision (pp. 2106–2113).
Kienzle, W., Schölkopf, B., Wichmann, F. A., & Franz, M. O. (2007a). How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements. In Pattern Recognition (pp. 405–414). Springer.
Kienzle, W., Wichmann, F., Schölkopf, B., & Franz, M. (2007b). A nonparametric approach to bottom-up visual saliency. In Advances in Neural Information Processing Systems.
Koch, K., McLean, J., Segev, R., Freed, M. A., Berry, M. J, I. I., & Balasubramanian, V. (2006). How much the eye tells the brain. Current Biology, 16(14), 1428–1434.
Lan, T., Wang, Y., & Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In International Conference on Computer Vision (ICCV).
Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, 2008 (CVPR 2008).
Lin, Z., Chen, M., & Ma, Y. (2010). The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055.
Liu, J., Ji, S., & Ye, J. (2009). SLEP: Sparse Learning with Efficient Projections. Tempe: Arizona State University.
Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., et al. (2011). Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 353–367.
Ma, Y. F., Hua, X. S., Lu, L., & Zhang, H. J. (2005). A generic framework of user attention model and its application in video summarization. IEEE Transactions on Multimedia, 7(5), 907–919.
Ma, Y. F., Lu, L., Zhang, H. J., & Li, M. (2002). A user attention model for video summarization. In ACM international conference on Multimedia, MULTIMEDIA ’02.
Mahadevan, V., & Vasconcelos, N. (2010). Spatiotemporal saliency in dynamic scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 171–177.
Mairal, J. (2012). Spams: A sparse modeling software [online], available: http://spams-devel.gforge.inria.fr.
Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11, 19–60.
Mallat, S. (2009). A wavelet tour of signal processing. New York: Academic Press.
Marat, S., Guironnet, M., Pellerin, D., et al. (2007). Video summarization using a visual attention model. In European Signal Processing Conference.
Marat, S., Phuoc, T. H., Granjon, L., Guyader, N., Pellerin, D., & Guérin-Dugué, A. (2009). Modelling spatio-temporal saliency to predict gaze direction for short videos. International Journal of Computer Vision, 82(3), 231–243.
Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In IEEE Conference on Computer Vision & Pattern Recognition.
Mathe, S., & Sminchisescu, C. (2012a). Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. Technical report, Institute of Mathematics of the Romanian Academy and University of Bonn.
Mathe, S., & Sminchisescu, C. (2012b). Dynamic eye movement datasets and learnt saliency models for visual action recognition. In IEEE European Conference on Computer Vision.
Meier, L., Van De Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 53–71.
Navalpakkam, V., & Itti, L. (2006). An integrated model of top-down and bottom-up attention for optimizing detection speed. In IEEE Conference on Computer Vision and Pattern Recognition.
Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23), 3311–3325.
Olshausen, B. A., & Field, D. J. (2004). Sparse coding of sensory inputs. Current Opinion in Neurobiology, 14(4), 481–487.
Poirier, F. J., Gosselin, F., & Arguin, M. (2008). Perceptive fields of saliency. Journal of Vision, 8(15), 14.
Qin, Z., Scheinberg, K., & Goldfarb, D. (2010). Efficient block-coordinate descent algorithms for the group lasso. Mathematical Programming Computation, 5, 143.
Rensink, R. A., O’Regan, J. K., & Clark, J. J. (1997). To see or not to see: The need for attention to perceive changes in scenes. Psychological Science, 8(5), 368–373.
Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In IEEE International Conference on Computer Vision and Pattern Recognition.
Roth, V., & Fischer, B. (2008). The group-lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In International Conference on Machine Learning (Vol. 104).
Rubinstein, R., Bruckstein, A. M., & Elad, M. (2010a). Dictionaries for sparse representation modeling. Proceedings of the IEEE.
Rubinstein, R., Zibulevsky, M., & Elad, M. (2010b). Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing, 58(3), 1553–1564.
Rudoy, D., Goldman, D. B., Shechtman, E., & Zelnik-Manor, L. (2013). Learning video saliency from human gaze using candidate selection. In 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1147–1154).
Seo, H. J., & Milanfar, P. (2009a). Nonparametric bottom-up saliency detection by self-resemblance. In Computer Vision and Pattern Recognition Workshops (pp. 45–52).
Seo, H. J., & Milanfar, P. (2009b). Static and space-time visual saliency detection by self-resemblance. Journal of Vision, 9(12), 15.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58, 267.
Treisman, A. M., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12(1), 97–136.
Triesch, J., Ballard, D. H., Hayhoe, M. M., & Sullivan, B. T. (2003). What you see is what you need. Journal of Vision, 3, 9.
Ungerleider, S. K., & Leslie, G. (2000). Mechanisms of visual attention in the human cortex. Annual Review of Neuroscience, 23(1), 315–341.
Vig, E., Dorr, M., Martinetz, T., & Barth, E. (2011). Eye movements show optimal average anticipation with natural dynamic scenes. Cognitive Computation, 3(1), 79–88.
Vig, E., Dorr, M., Martinetz, T., & Barth, E. (2012). Intrinsic dimensionality predicts the saliency of natural dynamic scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(6), 1080–1091.
Wang, H., Klaser, A., Schmid, C., & Liu, C.-L. (2011a). Action recognition by dense trajectories. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Wang, J., Wang, Y., & Zhang, Z. (2011b). Visual saliency based aerial video summarization by online scene classification. In International Conference on Image and Graphics (pp. 777–782).
Wright, J., Ganesh, A., Rao, S., Peng, Y., & Ma, Y. (2009). Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Advances in Neural Information Processing Systems (pp. 2080–2088).
Xu, C., & Corso, J. J. (2012). Evaluation of super-voxel methods for early video processing. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1202–1209).
Yan, J., Zhu, M., Liu, H., & Liu, Y. (2010). Visual saliency detection via sparsity pursuit. IEEE Signal Processing Letters, 17(8), 739–742.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49.
Zhai, Y., & Shah, M. (2006). Visual attention detection in video sequences using spatiotemporal cues. In ACM international conference on Multimedia (pp. 815–824).
Zhang, L., Tong, M. H., & Cottrell, G. W. (2009). Sunday: Saliency using natural statistics for dynamic analysis of scenes. In Annual Cognitive Science Conference (pp. 2944–2949).
Zhang, L., Tong, M. H., Marks, T. K., Shan, H., & Cottrell, G. W. (2008). Sun: A bayesian framework for saliency using natural statistics. Journal of Vision, 8(7), 32.
Zhong, S.-h., Liu, Y., Ren, F., Zhang, J., & Ren, T. (2013). Video saliency detection via dynamic consistent spatio-temporal attention modelling. In AAAI.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Acknowledgments
This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20066. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Jakob Verbeek.
Rights and permissions
About this article
Cite this article
Souly, N., Shah, M. Visual Saliency Detection Using Group Lasso Regularization in Videos of Natural Scenes. Int J Comput Vis 117, 93–110 (2016). https://doi.org/10.1007/s11263-015-0853-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-015-0853-6