Skip to main content

Fusion of Multiple Visual Cues for Object Recognition in Videos

  • Chapter
  • First Online:
Fusion in Computer Vision

Abstract

In this chapter, we are interested in the open problem of meaningful object recognition in video. Recently the approaches which estimate human visual attention and incorporate it into the whole visual content understanding process have become popular. In estimation of visual attention in a complex spatio-temporal content such as video one has to fuse multiple information channels such as motion, spatial contrast, and others. In the first part of the chapter, we are interested in these questions and report on optimal strategies of bottom–up fusion in visual saliency estimation. Then the estimated visual saliency is used in pooling of local descriptors. We compare different pooling approaches and show results on rather interesting visual content: that one recorded with wearable cameras for a large-scale research on Alzheimer’s disease. The results which will be shown together with conclusion demonstrate that the approaches based on the saliency fusion outperform the best state-of-the art techniques in this content.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera views. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2847–2854

    Google Scholar 

  2. Felzenszwalb PF, Girshick RB, McAllester DA, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645

    Article  Google Scholar 

  3. Lampert CH, Blaschko MB, Hofmann T (2008) Beyond sliding windows: object localization by efficient subwindow search. In: IEEE computer society conference on computer vision and pattern recognition (CVPR 2008), IEEE Computer Society, Anchorage, 24–26 June 2008

    Google Scholar 

  4. Itti L, Koch C (2001) Computational modelling of visual attention. Nat Rev Neurosci 2(3):194–203

    Google Scholar 

  5. Fathi A, Li Y, Rehg JM (2012) Learning to recognize daily actions using gaze. In: Proceedings of the 12th European conference on computer vision—Volume Part I, ECCV’12, pp 314–327, Springer, Berlin, 2012

    Google Scholar 

  6. Ogaki K, Kitani KM, Sugano Y, Sato Y (2012) Coupling eye-motion and ego-motion features for first-person activity recognition. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, IEEE, pp 1–7, 2012

    Google Scholar 

  7. Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, pp 1–22

    Google Scholar 

  8. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Schmid C, Soatto S, Tomasi C (eds) International conference on computer vision and pattern recognition, vol 2. INRIA Rhône-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334, pp 886–893

    Google Scholar 

  9. Jing F, Li M, Zhang H, Zhang B (2002) An effective region-based image retrieval framework. In: ACM international conference on multimedia, 2002

    Google Scholar 

  10. Long F, Zhang H, Feng D (2003) Fundamentals of content-based image retrieval. In: Multimedia information retrieval and management, 2003

    Google Scholar 

  11. Manjunath B, Ohm J, Vasudevan V, Yamada A (2001) Colour and texture descriptors. IEEE Trans Circ Sys Video Technol 11(6):703–715

    Article  Google Scholar 

  12. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Intern J Comput Vis 60:91–110

    Article  Google Scholar 

  13. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis Image Underst 110:346–359

    Google Scholar 

  14. Mokhtarian F, Suomela R (1998) Robust image corner detection through curvature scale space. IEEE Trans Pattern Anal Mach Intell 20(12):1376–1381

    Article  Google Scholar 

  15. Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: Proceedings of the international conference on computer vision 2:1470–1477

    Google Scholar 

  16. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: 2001 IEEE computer society conference on computer vision and pattern recognition, vol 1. IEEE, Los Alamitos, pp 511–518

    Google Scholar 

  17. Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera views. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR), IEEE, 2012

    Google Scholar 

  18. de Carvalho Soares R, da Silva I, Guliato D (2012) Spatial locality weighting of features using saliency map with a bag-of-visual-words approach. In: IEEE 24th international conference on tools with artificial intelligence (ICTAI), vol 1. pp 1070–1075

    Google Scholar 

  19. Sharma G, Jurie F, Schmid C (2012) Discriminative spatial saliency for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3506–3513

    Google Scholar 

  20. Vig E, Dorr M, Cox D (2012) Space-variant descriptor sampling for action recognition based on saliency and eye movements. Springer, Firenze, pp 84–97

    Google Scholar 

  21. Treisman AM, Gelade G (1980) A feature-integration theory of attention. Cogn Psychol 12(1):97–136

    Google Scholar 

  22. Borji A, Itti L (2012) State-of-the-art in visual attention modeling. IEEE Trans Pattern Anal Mach Intell 99 (PrePrints), 34(9):1758–1772

    Google Scholar 

  23. Vig E, Dorr M, Cox D (2012) Space-variant descriptor sampling for action recognition based on saliency and eye movements. In: European conference on computer vision, 2012

    Google Scholar 

  24. Tatler BW (2007) The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor biases and image feature distributions. J Vis 7(14):1–17

    Article  Google Scholar 

  25. Dorr M, Martinetz T, Gegenfurtner KR, Barth E (2010) Variability of eye movements when viewing dynamic natural scenes. J Vis, 10(10):28

    Google Scholar 

  26. Koch C, Ullman S (1985) Shifts in selective visual attention: towards the underlying neural circuitry. Hum Neurobiol 4:219–227

    Google Scholar 

  27. Posner MI, Cohen YA (1984) Components of visual orienting. In: Bouma H, Bouwhuis DG (eds) Attention and performance X: control of language processes. Lawrence Erlbaum, Hillsdale

    Google Scholar 

  28. Parkhurst D, Law K, Niebur E (2002) Modeling the role of salience in the allocation of overt visual attention. Vis Res 42(1):107–123

    Article  Google Scholar 

  29. Harel J, Koch C, Perona P (2007) Graph-based visual saliency. In: Advances in neural information processing systems 19. MIT Press, Cambridge, pp 545–552

    Google Scholar 

  30. Marat S, Ho Phuoc T, Granjon L, Guyader N, Pellerin D, Guérin-Dugué, V (2009) Modelling spatio-temporal saliency to predict gaze direction for short videos. Intern J Comput Vis 82(3):231–243

    Google Scholar 

  31. Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259

    Google Scholar 

  32. Itti L, Baldi PF (2006) Bayesian surprise attracts human attention. In: Advances in neural information processing systems, (NIPS*2005) vol 19. MIT Press, Cambridge, pp 547–554

    Google Scholar 

  33. Tsotsos JK, Bruce NDB (2006) Saliency based on information maximization. In: Weiss Y, Schölkopf B, Platt J (eds) Advances in Neural Information Processing Systems 18. MIT Press, Cambridge, pp 155–162

    Google Scholar 

  34. Itti L, Braun J, Lee DK, Koch C (1999) Attentional modulation of human pattern discrimination psychophysics reproduced by a quantitative model. In: Advances in neural information processing systems. MIT Press, Cambridge, p 1998

    Google Scholar 

  35. Itti L (June 2000) A saliency-based search mechanism for overt and covert shifts of visual attention. Vis Res 40(10–12):1489–1506

    Google Scholar 

  36. Lee DK, Itti L, Koch C, Braun J (Apr 1999) Attention activates winner-take-all competition among visual filters. Nat Neurosci 2(4):375–81

    Google Scholar 

  37. Brouard O, Ricordel V, Barba D (2009) Cartes de Saillance Spatio-Temporelle basées Contrastes de Couleur et Mouvement Relatif. In: Compression et representation des signaux audiovisuels, 2009

    Google Scholar 

  38. Farnebäck G (2000) Fast and accurate motion estimation using orientation tensors and parametric motion models. In: Proceedings of 15th international conference on pattern recognition, vol 1. IAPR, Barcelona, Sept 2000, pp 135–139

    Google Scholar 

  39. Fischler MA, Bolles RC (June 1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24:381–395

    Google Scholar 

  40. Daly SJ (1998) Engineering observations from spatiovelocity and spatiotemporal visual models. In: IS&T/SPIE conference on human vision and electronic imaging III:1, 1998

    Google Scholar 

  41. Boujut H, Benois-Pineau J, Megret R (2012) Fusion of multiple visual cues for visual saliency extraction from wearable camera settings with strong motion. In: Fusiello A, Murino V, Cucchiara R (eds) Computer vision—ECCV 2012. Workshops and Demonstrations, Lecture Notes in Computer Science, vol 7585. Springer, Berlin, pp 436–445

    Google Scholar 

  42. Land M, Mennie N, Rusted J (1999) The roles of vision and eye movements in the control of activities of daily living. Perception 28:1311–1328

    Article  Google Scholar 

  43. Moré JJ, Sorensen DC (1983) Computing a trust region step. SIAM J Sci Stat Comput 4(3):553–572

    Google Scholar 

  44. Boujut H, Benois-Pineau J, Ahmed T, Hadar O, Bonnet P (2011) A metric for no-reference video quality assessment for hd tv delivery based on saliency maps. In: IEEE international conference on multimedia and expo, July 2011

    Google Scholar 

  45. Tuytelaars T, Lampert C, Blaschko M, Buntine W (2010) Unsupervised object discovery: a comparison. Intern J Comput Vis 88:284–302

    Article  Google Scholar 

  46. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. In: IEEE conference on computer vision and pattern recognition, pp 1–8, June 2008

    Google Scholar 

  47. Marszałek M, Schmid C (2006) Spatial weighting for bag-of-features. In: IEEE conference on computer vision and pattern recognition, vol 2. pp 2118–2125

    Google Scholar 

  48. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297

    MATH  Google Scholar 

  49. Sreekanth V, Vedaldi A, Jawahar CV, Zisserman A (2010) Generalized RBF feature maps for efficient detection. In: Proceedings of the British machine vision conference (BMVC), 2010

    Google Scholar 

  50. Fathi A, Ren X, Rehg JM (2011) Learning to recognize objects in egocentric activities. In: The 24th IEEE conference on computer vision and pattern recognition, CVPR 2011, IEEE, Colorado Springs, 20–25 June 2011, pp 3281–3288

    Google Scholar 

  51. Over P, Awad G, Michel M, Fiscus J, Sanders G, Shaw B, Kraaij W, Smeaton AF, Quéenot G (2012) Trecvid 2012—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2012, NIST, USA, 2012

    Google Scholar 

Download references

Acknowledgments

This research has been supported by the region of Aquitaine and the European Community’s program (FP7/2007–2014) under Grant Agreement 288199 (Dem@care Project).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jenny Benois-Pineau .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

González-Díaz, I., Benois-Pineau, J., Buso, V., Boujut, H. (2014). Fusion of Multiple Visual Cues for Object Recognition in Videos. In: Ionescu, B., Benois-Pineau, J., Piatrik, T., Quénot, G. (eds) Fusion in Computer Vision. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-05696-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-05696-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-05695-1

  • Online ISBN: 978-3-319-05696-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics