Deep Saliency: Prediction of Interestingness in Video with CNN

Chaabouni, Souad; Benois-Pineau, Jenny; Zemmari, Akka; Ben Amar, Chokri

doi:10.1007/978-3-319-57687-9_3

Souad Chaabouni⁴,
Jenny Benois-Pineau⁴,
Akka Zemmari⁴ &
…
Chokri Ben Amar⁵

Part of the book series: Multimedia Systems and Applications ((MMSA))

514 Accesses
2 Citations

Abstract

Deep Neural Networks have become winners in indexing of visual information. They have allowed achievement of better performances in the fundamental tasks of visual information indexing and retrieval such as image classification and object recognition. In fine-grain indexing tasks, namely object recognition in visual scenes, the CNNs classifiers have to evaluate multiple “object proposals”, that is windows in the image plane of different size and location. Hence the problem of recognition is coupled with the problem of localization. In this chapter a model of prediction of Areas-if-Interest in video on the basis of Deep CNNs is proposed. A Deep CNN architecture is designed to classify windows in salient and non-salient. Then dense saliency maps are built upon classification score results. Using the known sensitivity of human visual system (HVS) to residual motion, the usual primary features such as pixel colour values are completed with residual motion features. The experiments show that the choice of the input features for the Deep CNN depends on visual task: for the interest in dynamic content, the proposed model with residual motion is more efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available at http://www.di.ens.fr/~laptev/actions/hollywood2/.
2.
Available at https://crcns.org/data-sets/eye/eye-1.

References

Borji, A., Itti, L.: State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 185–207 (2013)
Article Google Scholar
Boulos, F., Chen, W., Parrein, B., Le Callet, P.: Region-of-interest intra prediction for H.264/AVC error resilience. In: IEEE International Conference on Image Processing, Cairo, pp. 3109–3112 (2009)
Google Scholar
Boulos, F., Chen, W., Parrein, B., Le Callet, P.: Region-of-interest intra prediction for H.264/AVC error resilience. In: IEEE International Conference on Image Processing, Cairo, pp. 3109–3112 (2009). https://hal.archives-ouvertes.fr/hal-00458957
Chaabouni, S., Benois-Pineau, J., Ben Amar, C.: Transfer learning with deep networks for saliency prediction in natural video. In: 2016 IEEE International Conference on Image Processing, ICIP 2016, vol. 91 (2016)
Google Scholar
Chaabouni, S., Benois-Pineau, J., Hadar, O.: Prediction of visual saliency in video with deep CNNs. Proceedings of the SPIE Optical Engineering + Applications, pp. 9711Q-99711Q-14 (2016)
Google Scholar
Chaabouni, S., Benois-Pineau, J., Tison, F., Ben Amar, C.: Prediction of visual attention with Deep CNN for studies of neurodegenerative diseases. In: 14th International Workshop on Content-Based Multimedia Indexing CBMI 2016, Bucharest, 15–17 June 2016
Google Scholar
Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3281–3288 (2011)
Google Scholar
Geng, M., Wang, Y., Xiang, T., Tian, Y.: Deep Transfer Learning for Person Re-identification. CoRR abs/1611.05244 (2016). http://arxiv.org/abs/1611.05244
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 142–158 (2016)
Article Google Scholar
González-Díaz, I., Buso, V., Benois-Pineau, J.: Perceptual modeling in the problem of active object recognition in visual scenes. Pattern Recogn. 56, 129–141 (2016)
Article Google Scholar
Gygli, M., Soleymani, M.: Analyzing and predicting GIF interestingness. In: Proceedings of the 2016 ACM on Multimedia Conference, MM ’16, pp. 122–126. ACM, New York (2016). doi:10.1145/2964284.2967195. http://doi.acm.org/10.1145/2964284.2967195
Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural Information Processing Systems, vol. 19, pp. 545–552. MIT Press, Cambridge (2007)
Google Scholar
Hou, X., Harel, J., Koch, C.: Image signature: highlighting sparse salient regions. IEEE Trans. Pattern Anal. Mach. Intell. 34(1), 194–201 (2012)
Article Google Scholar
Itti, L.: CRCNS data sharing: eye movements during free-viewing of natural videos. In: Collaborative Research in Computational Neuroscience Annual Meeting, Los Angeles, CA (2008)
Google Scholar
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
Article Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, 03–07 November, 2014, pp. 675–678 (2014)
Google Scholar
Jiang, Y., Wang, Y., Feng, R., Xue, X., Zheng, Y., Yang, H.: Understanding and predicting interestingness of videos. In: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI’13, pp. 1113–1119. AAAI Press, Palo Alto (2013). http://dl.acm.org/citation.cfm?id=2891460.2891615
Krizhevsky, A.: Learning multiple layers of features from tiny images. Ph.D. thesis, University of Toronto (2009)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc., Red Hook (2012)
Google Scholar
Le Meur, O., Baccino, T.: Methods for comparing scanpaths and saliency maps: strengths and weaknesses. Behav. Res. Methods 45(1), 251–266 (2010)
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Li, G.Y.Y.: Visual saliency based on multiscale deep features. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5455–5463 (2015)
Google Scholar
Li, G.Y.Y.: Deep contrast learning for salient object detection. In: IEEE Conference on Computer Vision and Pattern Recognition. 1603.01976 (2016)
Google Scholar
Lin, Y., Kong, S., Wang, D., Zhuang, Y.: Saliency detection within a deep convolutional architecture. In: Cognitive Computing for Augmented Human Intelligence: Papers from the AAAI-14 Workshop, pp. 31–37 (2014)
Google Scholar
Liu, N.H.J.Z.D.W.S., Liu, T.: Predicting eye fixations using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 362–370 (2015)
Google Scholar
Mai, L., Le, H., Niu, Y., Liu, F.: Rule of thirds detection from photograph. In: 2011 IEEE International Symposium on Multimedia (ISM), pp. 91–96 (2011)
Google Scholar
Manerba, F., Benois-Pineau, J., Leonardi, R.: Extraction of foreground objects from MPEG2 video stream in rough indexing framework. In: Proceedings of the EI2004, Storage and Retrieval Methods and Applications for Multimedia 2004, pp. 50–60 (2004). https://hal.archives-ouvertes.fr/hal-00308051
Google Scholar
Marat, S., Ho Phuoc, T., Granjon, L., Guyader, N., Pellerin, D., Guérin-Dugué, A.: Modelling spatio-temporal saliency to predict gaze direction for short videos. Int. J. Comput. Vis. 82(3), 231–243 (2009)
Article Google Scholar
Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision & Pattern Recognition (2009)
Google Scholar
Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1408–1424 (2015)
Article Google Scholar
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I.J., Lavoie, E., Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., Bergstra, J.: Unsupervised and transfer learning challenge: a deep learning approach. In: JMLR W& CP: Proceedings of the Unsupervised and Transfer Learning Challenge and Workshop, vol. 27, pp. 97–110 (2012)
Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(O\left (1/k^{2}\right )\). Sov. Math. Doklady 27, 372–376 (1983)
MATH Google Scholar
Nettleton, D.F., Orriols-Puig, A., Fornells, A.: A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 33(4), 275–306 (2010). http://dx.doi.org/10.1007/s10462--010-9156-z
Article Google Scholar
Pan, J.G.: End-to-end convolutional network for saliency prediction. In: IEEE Conference on Computer Vision and Pattern Recognition 1507.01422 (2015)
Google Scholar
Pérez de San Roman, P., Benois-Pineau, J., Domenger, J.P., Paclet, F., Cataert, D., De Rugy, A.: Saliency Driven Object recognition in egocentric videos with deep CNN. CoRR abs/1606.07256 (2016). http://arxiv.org/abs/1606.07256
Pinto, Y., van der Leij, A.R., Sligte, I.G., Lamme, V.F., Scholte, H.S.: Bottom-up and top-down attention are independent. J. Vis. 13(3), 16 (2013)
Article Google Scholar
Polyak, B.: Introduction to Optimization (Translations Series in Mathematics and Engineering). Optimization Software, New York (1987)
Google Scholar
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, New York (1992)
MATH Google Scholar
Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. CoRR abs/1412.6596 (2014). http://arxiv.org/abs/1412.6596
Seo, H.J., Milanfar, P.: Static and space-time visual saliency detection by self-resemblance. J. Vis. 9(12), 15, 1–27 (2009)
Google Scholar
Shen, J., Itti, L.: Top-down influences on visual attention during listening are modulated by observer sex. Vis. Res. 65, 62–76 (2012)
Article Google Scholar
Shen, C., Zhao, Q.: Learning to predict eye fixations for semantic contents using multi-layer sparse network. Neurocomputing 138, 61–68 (2014)
Article Google Scholar
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. CoRR abs/1312.6034 (2013)
Google Scholar
Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cogn. Psychol. 12(1), 97–136 (1980)
Article Google Scholar
Uijlings, J., de Sande, K.V., Gevers, T., Smeulders, A.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Article Google Scholar
Vapnik, V.: Principles of risk minimization for learning theory. In: Moody, J.E., Hanson, S.J., Lippmann, R. (eds.) NIPS, pp. 831–838. Morgan Kaufmann, Burlington (1991)
Google Scholar
Vig, E., Dorr, M., Cox, D.: Large-scale optimization of hierarchical features for saliency prediction in natural images. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’14, pp. 2798–2805 (2014)
Google Scholar
Wooding, D.S.: Eye movements of large populations: II. Deriving regions of interest, coverage, and similarity using fixation maps. Behav. Res. Methods Instrum. Comput. 34(4), 518–528 (2002)
Google Scholar
Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for image classification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Xu, J., Mukherjee, L., Li, Y., Warner, J., Rehg, J.M., Singh, V.: Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the CVPR (2015)
Book Google Scholar
Yoon, S., Pavlovic, V.: Sentiment flow for video interestingness prediction. In: Proceedings of the 1st ACM International Workshop on Human Centered Event Understanding from Multimedia, HuEvent 14, pp. 29–34. ACM, New York (2014). doi:10.1145/2660505.2660513
Google Scholar
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 3320–3328. Curran Associates, Inc., Red Hook (2014)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and Understanding Convolutional Networks. CoRR abs/1311.2901 (2013)
Google Scholar
Zen, G., de Juan, P., Song, Y., Jaimes, A.: Mouse activity as an indicator of interestingness in video. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, ICMR ’16, pp. 47–54. ACM, New York (2016). doi:10.1145/2911996.2912005. http://doi.acm.org/10.1145/2911996.2912005
Zhou, Z., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE Conference on Computer Vision and Pattern Recognition 1512.04150 (2015)
Google Scholar
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004). doi:10.1007/s10462-004-0751-8
Google Scholar
Hebb, D.O.: The Organisation of Behaviour: A Neurophysiological Theory, p. 379. Laurence Erlbaum Associates, Inc. Mahwah (2002). ISBN:1-4106-1240-6. Originaly published Willey, New York (1949)
Google Scholar
Rosenblatt, F., The perception: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386-408 (1958)
Article Google Scholar

Download references

Acknowledgements

This research has been supported by University of Bordeaux, University of Sfax and the grant UNetBA.

Author information

Authors and Affiliations

LaBRI UMR 5800, Univ. Bordeaux, CNRS, Bordeaux INP, Univ. Bordeaux, 351, crs de la Liberation, F33405, Talence Cedex, France
Souad Chaabouni, Jenny Benois-Pineau & Akka Zemmari
REGIM-Lab LR11ES48, National Engineering School of Sfax, BP1173, 3038, Sfax, Tunisia
Chokri Ben Amar

Authors

Souad Chaabouni
View author publications
You can also search for this author in PubMed Google Scholar
Jenny Benois-Pineau
View author publications
You can also search for this author in PubMed Google Scholar
Akka Zemmari
View author publications
You can also search for this author in PubMed Google Scholar
Chokri Ben Amar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Souad Chaabouni .

Editor information

Editors and Affiliations

LaBRI UMR 5800, Univ. Bordeaux, CNRS, Bordeaux INP, Univ. Bordeaux, Talence, France
Jenny Benois-Pineau
LS2N, UMR CNRS 6004, Université de Nantes, Nantes Cedex 3, France
Patrick Le Callet

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chaabouni, S., Benois-Pineau, J., Zemmari, A., Ben Amar, C. (2017). Deep Saliency: Prediction of Interestingness in Video with CNN. In: Benois-Pineau, J., Le Callet, P. (eds) Visual Content Indexing and Retrieval with Psycho-Visual Models. Multimedia Systems and Applications. Springer, Cham. https://doi.org/10.1007/978-3-319-57687-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-57687-9_3
Published: 16 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57686-2
Online ISBN: 978-3-319-57687-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics