Abstract
In this chapter, we are interested in the open problem of meaningful object recognition in video. Recently the approaches which estimate human visual attention and incorporate it into the whole visual content understanding process have become popular. In estimation of visual attention in a complex spatio-temporal content such as video one has to fuse multiple information channels such as motion, spatial contrast, and others. In the first part of the chapter, we are interested in these questions and report on optimal strategies of bottom–up fusion in visual saliency estimation. Then the estimated visual saliency is used in pooling of local descriptors. We compare different pooling approaches and show results on rather interesting visual content: that one recorded with wearable cameras for a large-scale research on Alzheimer’s disease. The results which will be shown together with conclusion demonstrate that the approaches based on the saliency fusion outperform the best state-of-the art techniques in this content.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera views. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2847–2854
Felzenszwalb PF, Girshick RB, McAllester DA, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645
Lampert CH, Blaschko MB, Hofmann T (2008) Beyond sliding windows: object localization by efficient subwindow search. In: IEEE computer society conference on computer vision and pattern recognition (CVPR 2008), IEEE Computer Society, Anchorage, 24–26 June 2008
Itti L, Koch C (2001) Computational modelling of visual attention. Nat Rev Neurosci 2(3):194–203
Fathi A, Li Y, Rehg JM (2012) Learning to recognize daily actions using gaze. In: Proceedings of the 12th European conference on computer vision—Volume Part I, ECCV’12, pp 314–327, Springer, Berlin, 2012
Ogaki K, Kitani KM, Sugano Y, Sato Y (2012) Coupling eye-motion and ego-motion features for first-person activity recognition. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, IEEE, pp 1–7, 2012
Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, pp 1–22
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Schmid C, Soatto S, Tomasi C (eds) International conference on computer vision and pattern recognition, vol 2. INRIA Rhône-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334, pp 886–893
Jing F, Li M, Zhang H, Zhang B (2002) An effective region-based image retrieval framework. In: ACM international conference on multimedia, 2002
Long F, Zhang H, Feng D (2003) Fundamentals of content-based image retrieval. In: Multimedia information retrieval and management, 2003
Manjunath B, Ohm J, Vasudevan V, Yamada A (2001) Colour and texture descriptors. IEEE Trans Circ Sys Video Technol 11(6):703–715
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Intern J Comput Vis 60:91–110
Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis Image Underst 110:346–359
Mokhtarian F, Suomela R (1998) Robust image corner detection through curvature scale space. IEEE Trans Pattern Anal Mach Intell 20(12):1376–1381
Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: Proceedings of the international conference on computer vision 2:1470–1477
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: 2001 IEEE computer society conference on computer vision and pattern recognition, vol 1. IEEE, Los Alamitos, pp 511–518
Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera views. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR), IEEE, 2012
de Carvalho Soares R, da Silva I, Guliato D (2012) Spatial locality weighting of features using saliency map with a bag-of-visual-words approach. In: IEEE 24th international conference on tools with artificial intelligence (ICTAI), vol 1. pp 1070–1075
Sharma G, Jurie F, Schmid C (2012) Discriminative spatial saliency for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3506–3513
Vig E, Dorr M, Cox D (2012) Space-variant descriptor sampling for action recognition based on saliency and eye movements. Springer, Firenze, pp 84–97
Treisman AM, Gelade G (1980) A feature-integration theory of attention. Cogn Psychol 12(1):97–136
Borji A, Itti L (2012) State-of-the-art in visual attention modeling. IEEE Trans Pattern Anal Mach Intell 99 (PrePrints), 34(9):1758–1772
Vig E, Dorr M, Cox D (2012) Space-variant descriptor sampling for action recognition based on saliency and eye movements. In: European conference on computer vision, 2012
Tatler BW (2007) The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor biases and image feature distributions. J Vis 7(14):1–17
Dorr M, Martinetz T, Gegenfurtner KR, Barth E (2010) Variability of eye movements when viewing dynamic natural scenes. J Vis, 10(10):28
Koch C, Ullman S (1985) Shifts in selective visual attention: towards the underlying neural circuitry. Hum Neurobiol 4:219–227
Posner MI, Cohen YA (1984) Components of visual orienting. In: Bouma H, Bouwhuis DG (eds) Attention and performance X: control of language processes. Lawrence Erlbaum, Hillsdale
Parkhurst D, Law K, Niebur E (2002) Modeling the role of salience in the allocation of overt visual attention. Vis Res 42(1):107–123
Harel J, Koch C, Perona P (2007) Graph-based visual saliency. In: Advances in neural information processing systems 19. MIT Press, Cambridge, pp 545–552
Marat S, Ho Phuoc T, Granjon L, Guyader N, Pellerin D, Guérin-Dugué, V (2009) Modelling spatio-temporal saliency to predict gaze direction for short videos. Intern J Comput Vis 82(3):231–243
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259
Itti L, Baldi PF (2006) Bayesian surprise attracts human attention. In: Advances in neural information processing systems, (NIPS*2005) vol 19. MIT Press, Cambridge, pp 547–554
Tsotsos JK, Bruce NDB (2006) Saliency based on information maximization. In: Weiss Y, Schölkopf B, Platt J (eds) Advances in Neural Information Processing Systems 18. MIT Press, Cambridge, pp 155–162
Itti L, Braun J, Lee DK, Koch C (1999) Attentional modulation of human pattern discrimination psychophysics reproduced by a quantitative model. In: Advances in neural information processing systems. MIT Press, Cambridge, p 1998
Itti L (June 2000) A saliency-based search mechanism for overt and covert shifts of visual attention. Vis Res 40(10–12):1489–1506
Lee DK, Itti L, Koch C, Braun J (Apr 1999) Attention activates winner-take-all competition among visual filters. Nat Neurosci 2(4):375–81
Brouard O, Ricordel V, Barba D (2009) Cartes de Saillance Spatio-Temporelle basées Contrastes de Couleur et Mouvement Relatif. In: Compression et representation des signaux audiovisuels, 2009
Farnebäck G (2000) Fast and accurate motion estimation using orientation tensors and parametric motion models. In: Proceedings of 15th international conference on pattern recognition, vol 1. IAPR, Barcelona, Sept 2000, pp 135–139
Fischler MA, Bolles RC (June 1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24:381–395
Daly SJ (1998) Engineering observations from spatiovelocity and spatiotemporal visual models. In: IS&T/SPIE conference on human vision and electronic imaging III:1, 1998
Boujut H, Benois-Pineau J, Megret R (2012) Fusion of multiple visual cues for visual saliency extraction from wearable camera settings with strong motion. In: Fusiello A, Murino V, Cucchiara R (eds) Computer vision—ECCV 2012. Workshops and Demonstrations, Lecture Notes in Computer Science, vol 7585. Springer, Berlin, pp 436–445
Land M, Mennie N, Rusted J (1999) The roles of vision and eye movements in the control of activities of daily living. Perception 28:1311–1328
Moré JJ, Sorensen DC (1983) Computing a trust region step. SIAM J Sci Stat Comput 4(3):553–572
Boujut H, Benois-Pineau J, Ahmed T, Hadar O, Bonnet P (2011) A metric for no-reference video quality assessment for hd tv delivery based on saliency maps. In: IEEE international conference on multimedia and expo, July 2011
Tuytelaars T, Lampert C, Blaschko M, Buntine W (2010) Unsupervised object discovery: a comparison. Intern J Comput Vis 88:284–302
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. In: IEEE conference on computer vision and pattern recognition, pp 1–8, June 2008
Marszałek M, Schmid C (2006) Spatial weighting for bag-of-features. In: IEEE conference on computer vision and pattern recognition, vol 2. pp 2118–2125
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297
Sreekanth V, Vedaldi A, Jawahar CV, Zisserman A (2010) Generalized RBF feature maps for efficient detection. In: Proceedings of the British machine vision conference (BMVC), 2010
Fathi A, Ren X, Rehg JM (2011) Learning to recognize objects in egocentric activities. In: The 24th IEEE conference on computer vision and pattern recognition, CVPR 2011, IEEE, Colorado Springs, 20–25 June 2011, pp 3281–3288
Over P, Awad G, Michel M, Fiscus J, Sanders G, Shaw B, Kraaij W, Smeaton AF, Quéenot G (2012) Trecvid 2012—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2012, NIST, USA, 2012
Acknowledgments
This research has been supported by the region of Aquitaine and the European Community’s program (FP7/2007–2014) under Grant Agreement 288199 (Dem@care Project).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
González-Díaz, I., Benois-Pineau, J., Buso, V., Boujut, H. (2014). Fusion of Multiple Visual Cues for Object Recognition in Videos. In: Ionescu, B., Benois-Pineau, J., Piatrik, T., Quénot, G. (eds) Fusion in Computer Vision. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-05696-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-05696-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05695-1
Online ISBN: 978-3-319-05696-8
eBook Packages: Computer ScienceComputer Science (R0)