Abstract
Deep Neural Networks have become winners in indexing of visual information. They have allowed achievement of better performances in the fundamental tasks of visual information indexing and retrieval such as image classification and object recognition. In fine-grain indexing tasks, namely object recognition in visual scenes, the CNNs classifiers have to evaluate multiple “object proposals”, that is windows in the image plane of different size and location. Hence the problem of recognition is coupled with the problem of localization. In this chapter a model of prediction of Areas-if-Interest in video on the basis of Deep CNNs is proposed. A Deep CNN architecture is designed to classify windows in salient and non-salient. Then dense saliency maps are built upon classification score results. Using the known sensitivity of human visual system (HVS) to residual motion, the usual primary features such as pixel colour values are completed with residual motion features. The experiments show that the choice of the input features for the Deep CNN depends on visual task: for the interest in dynamic content, the proposed model with residual motion is more efficient.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available at http://www.di.ens.fr/~laptev/actions/hollywood2/.
- 2.
Available at https://crcns.org/data-sets/eye/eye-1.
References
Borji, A., Itti, L.: State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 185–207 (2013)
Boulos, F., Chen, W., Parrein, B., Le Callet, P.: Region-of-interest intra prediction for H.264/AVC error resilience. In: IEEE International Conference on Image Processing, Cairo, pp. 3109–3112 (2009)
Boulos, F., Chen, W., Parrein, B., Le Callet, P.: Region-of-interest intra prediction for H.264/AVC error resilience. In: IEEE International Conference on Image Processing, Cairo, pp. 3109–3112 (2009). https://hal.archives-ouvertes.fr/hal-00458957
Chaabouni, S., Benois-Pineau, J., Ben Amar, C.: Transfer learning with deep networks for saliency prediction in natural video. In: 2016 IEEE International Conference on Image Processing, ICIP 2016, vol. 91 (2016)
Chaabouni, S., Benois-Pineau, J., Hadar, O.: Prediction of visual saliency in video with deep CNNs. Proceedings of the SPIE Optical Engineering + Applications, pp. 9711Q-99711Q-14 (2016)
Chaabouni, S., Benois-Pineau, J., Tison, F., Ben Amar, C.: Prediction of visual attention with Deep CNN for studies of neurodegenerative diseases. In: 14th International Workshop on Content-Based Multimedia Indexing CBMI 2016, Bucharest, 15–17 June 2016
Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3281–3288 (2011)
Geng, M., Wang, Y., Xiang, T., Tian, Y.: Deep Transfer Learning for Person Re-identification. CoRR abs/1611.05244 (2016). http://arxiv.org/abs/1611.05244
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 142–158 (2016)
González-Díaz, I., Buso, V., Benois-Pineau, J.: Perceptual modeling in the problem of active object recognition in visual scenes. Pattern Recogn. 56, 129–141 (2016)
Gygli, M., Soleymani, M.: Analyzing and predicting GIF interestingness. In: Proceedings of the 2016 ACM on Multimedia Conference, MM ’16, pp. 122–126. ACM, New York (2016). doi:10.1145/2964284.2967195. http://doi.acm.org/10.1145/2964284.2967195
Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural Information Processing Systems, vol. 19, pp. 545–552. MIT Press, Cambridge (2007)
Hou, X., Harel, J., Koch, C.: Image signature: highlighting sparse salient regions. IEEE Trans. Pattern Anal. Mach. Intell. 34(1), 194–201 (2012)
Itti, L.: CRCNS data sharing: eye movements during free-viewing of natural videos. In: Collaborative Research in Computational Neuroscience Annual Meeting, Los Angeles, CA (2008)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, 03–07 November, 2014, pp. 675–678 (2014)
Jiang, Y., Wang, Y., Feng, R., Xue, X., Zheng, Y., Yang, H.: Understanding and predicting interestingness of videos. In: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI’13, pp. 1113–1119. AAAI Press, Palo Alto (2013). http://dl.acm.org/citation.cfm?id=2891460.2891615
Krizhevsky, A.: Learning multiple layers of features from tiny images. Ph.D. thesis, University of Toronto (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc., Red Hook (2012)
Le Meur, O., Baccino, T.: Methods for comparing scanpaths and saliency maps: strengths and weaknesses. Behav. Res. Methods 45(1), 251–266 (2010)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Li, G.Y.Y.: Visual saliency based on multiscale deep features. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5455–5463 (2015)
Li, G.Y.Y.: Deep contrast learning for salient object detection. In: IEEE Conference on Computer Vision and Pattern Recognition. 1603.01976 (2016)
Lin, Y., Kong, S., Wang, D., Zhuang, Y.: Saliency detection within a deep convolutional architecture. In: Cognitive Computing for Augmented Human Intelligence: Papers from the AAAI-14 Workshop, pp. 31–37 (2014)
Liu, N.H.J.Z.D.W.S., Liu, T.: Predicting eye fixations using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 362–370 (2015)
Mai, L., Le, H., Niu, Y., Liu, F.: Rule of thirds detection from photograph. In: 2011 IEEE International Symposium on Multimedia (ISM), pp. 91–96 (2011)
Manerba, F., Benois-Pineau, J., Leonardi, R.: Extraction of foreground objects from MPEG2 video stream in rough indexing framework. In: Proceedings of the EI2004, Storage and Retrieval Methods and Applications for Multimedia 2004, pp. 50–60 (2004). https://hal.archives-ouvertes.fr/hal-00308051
Marat, S., Ho Phuoc, T., Granjon, L., Guyader, N., Pellerin, D., Guérin-Dugué, A.: Modelling spatio-temporal saliency to predict gaze direction for short videos. Int. J. Comput. Vis. 82(3), 231–243 (2009)
Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision & Pattern Recognition (2009)
Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1408–1424 (2015)
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I.J., Lavoie, E., Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., Bergstra, J.: Unsupervised and transfer learning challenge: a deep learning approach. In: JMLR W& CP: Proceedings of the Unsupervised and Transfer Learning Challenge and Workshop, vol. 27, pp. 97–110 (2012)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(O\left (1/k^{2}\right )\). Sov. Math. Doklady 27, 372–376 (1983)
Nettleton, D.F., Orriols-Puig, A., Fornells, A.: A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 33(4), 275–306 (2010). http://dx.doi.org/10.1007/s10462--010-9156-z
Pan, J.G.: End-to-end convolutional network for saliency prediction. In: IEEE Conference on Computer Vision and Pattern Recognition 1507.01422 (2015)
Pérez de San Roman, P., Benois-Pineau, J., Domenger, J.P., Paclet, F., Cataert, D., De Rugy, A.: Saliency Driven Object recognition in egocentric videos with deep CNN. CoRR abs/1606.07256 (2016). http://arxiv.org/abs/1606.07256
Pinto, Y., van der Leij, A.R., Sligte, I.G., Lamme, V.F., Scholte, H.S.: Bottom-up and top-down attention are independent. J. Vis. 13(3), 16 (2013)
Polyak, B.: Introduction to Optimization (Translations Series in Mathematics and Engineering). Optimization Software, New York (1987)
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, New York (1992)
Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. CoRR abs/1412.6596 (2014). http://arxiv.org/abs/1412.6596
Seo, H.J., Milanfar, P.: Static and space-time visual saliency detection by self-resemblance. J. Vis. 9(12), 15, 1–27 (2009)
Shen, J., Itti, L.: Top-down influences on visual attention during listening are modulated by observer sex. Vis. Res. 65, 62–76 (2012)
Shen, C., Zhao, Q.: Learning to predict eye fixations for semantic contents using multi-layer sparse network. Neurocomputing 138, 61–68 (2014)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. CoRR abs/1312.6034 (2013)
Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cogn. Psychol. 12(1), 97–136 (1980)
Uijlings, J., de Sande, K.V., Gevers, T., Smeulders, A.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Vapnik, V.: Principles of risk minimization for learning theory. In: Moody, J.E., Hanson, S.J., Lippmann, R. (eds.) NIPS, pp. 831–838. Morgan Kaufmann, Burlington (1991)
Vig, E., Dorr, M., Cox, D.: Large-scale optimization of hierarchical features for saliency prediction in natural images. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’14, pp. 2798–2805 (2014)
Wooding, D.S.: Eye movements of large populations: II. Deriving regions of interest, coverage, and similarity using fixation maps. Behav. Res. Methods Instrum. Comput. 34(4), 518–528 (2002)
Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for image classification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Xu, J., Mukherjee, L., Li, Y., Warner, J., Rehg, J.M., Singh, V.: Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the CVPR (2015)
Yoon, S., Pavlovic, V.: Sentiment flow for video interestingness prediction. In: Proceedings of the 1st ACM International Workshop on Human Centered Event Understanding from Multimedia, HuEvent 14, pp. 29–34. ACM, New York (2014). doi:10.1145/2660505.2660513
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 3320–3328. Curran Associates, Inc., Red Hook (2014)
Zeiler, M.D., Fergus, R.: Visualizing and Understanding Convolutional Networks. CoRR abs/1311.2901 (2013)
Zen, G., de Juan, P., Song, Y., Jaimes, A.: Mouse activity as an indicator of interestingness in video. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, ICMR ’16, pp. 47–54. ACM, New York (2016). doi:10.1145/2911996.2912005. http://doi.acm.org/10.1145/2911996.2912005
Zhou, Z., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE Conference on Computer Vision and Pattern Recognition 1512.04150 (2015)
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004). doi:10.1007/s10462-004-0751-8
Hebb, D.O.: The Organisation of Behaviour: A Neurophysiological Theory, p. 379. Laurence Erlbaum Associates, Inc. Mahwah (2002). ISBN:1-4106-1240-6. Originaly published Willey, New York (1949)
Rosenblatt, F., The perception: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386-408 (1958)
Acknowledgements
This research has been supported by University of Bordeaux, University of Sfax and the grant UNetBA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Chaabouni, S., Benois-Pineau, J., Zemmari, A., Ben Amar, C. (2017). Deep Saliency: Prediction of Interestingness in Video with CNN. In: Benois-Pineau, J., Le Callet, P. (eds) Visual Content Indexing and Retrieval with Psycho-Visual Models. Multimedia Systems and Applications. Springer, Cham. https://doi.org/10.1007/978-3-319-57687-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-57687-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57686-2
Online ISBN: 978-3-319-57687-9
eBook Packages: Computer ScienceComputer Science (R0)