Skip to main content

Audio and Video Feature Fusion for Activity Recognition in Unconstrained Videos

  • Conference paper
Intelligent Data Engineering and Automated Learning – IDEAL 2006 (IDEAL 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4224))

Abstract

Combining audio and image processing for understanding video content has several benefits when compared to using each modality on their own. For the task of context and activity recognition in video sequences, it is important to explore both data streams to gather relevant information. In this paper we describe a video context and activity recognition model. Our work extracts a range of audio and visual features, followed by feature reduction and information fusion. We show that combining audio with video based decision making improves the quality of context and activity recognition in videos by 4% over audio data and 18% over image data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Boersma, P.: Accurate Short-Term Analysis of the Fundamental Frequency and the Harmonics- to-Noise Ratio of a Sampled Sound. In: Institute of Phonetic Sciences, University of Amsterdam, Proceedings, vol. 17 (1993)

    Google Scholar 

  2. Halif, R., Flusser, J.: Numerically Stable Direct Least Squares Fitting of Ellipses. Department of Software Engineering, Charles University, Czech Republic (2000)

    Google Scholar 

  3. Hu, Y.H., Hwant, J.-N.: Handbook of Neural Network Signal Processing. CRC Press, Boca Raton

    Google Scholar 

  4. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998)

    Article  Google Scholar 

  5. Kobes, R., Kunstatter, G.: Physics 1501 – Modern Technology Physics Department, University of Winnipeg

    Google Scholar 

  6. Laws, K.I.: Textured image segmentation, Ph.D. thesis, University of Southern California (1980)

    Google Scholar 

  7. Liu, Z., Wang, Y.: Audio Feature Extraction and Analysis for Scene Segmentation and Classification. Journal of VLSI Signal Processing, 61–79 (1998)

    Google Scholar 

  8. Liu, Z., Huang, J., Wang, Y.: Classification of TV Programs Based on Audio Information Using Hidden Markov Model. In: IEEE Workshop on Multimedia Signal Processing (1998)

    Google Scholar 

  9. Lopes, J., Lin, C., Singh, S.: Multi-stage Classification for Audio based Activity Recognition. In: Submited to International Conference on Intelligent Data Engineering and Automated Learning (2006)

    Google Scholar 

  10. Lucas, B.D., Kanade, T.: An Iterative Image Registration Technique with an Application to Stereo Vision. In: International Joint Conference on Artificial Intelligence, pp. 674–679 (1981)

    Google Scholar 

  11. Martin, J.C., Veldman, R., Beroule, D.: Developing multimodal interfaces: a theoretical framework and guided propagation networks. In: Bunt, H., Beun, R.J., Borghuis, T. (eds.) Multimodal Human-Computer Communication (1998)

    Google Scholar 

  12. Mindru, F., Moons, T., Van Gool, L.: Recognizing color patterns irrespective of viewpoint and illumination. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 1999), pp. 368–373 (1999)

    Google Scholar 

  13. Naphade, M.R., Huang, T.: Extracting semantics from audiovisual content: the final frontier in multimedia retrieval. IEEE Transactions on Neural Networks 13, 793–810 (2002)

    Article  Google Scholar 

  14. Pudil, P., Navovicova, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognition Letters 15, 1119–1125 (1994)

    Article  Google Scholar 

  15. Sharma, R., Pavlovic, V.I., Huang, T.S.: Toward multimodal human-computer interface. Proceedings of the IEEE 86(5), 853–869 (1998)

    Article  Google Scholar 

  16. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis and Machine Vision. Brooks/Cole (1999)

    Google Scholar 

  17. Watkinson, J.: The Engineer’s Guide to Motion Compensation, Petersfield, Snell & Wilcox (1994)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lopes, J., Singh, S. (2006). Audio and Video Feature Fusion for Activity Recognition in Unconstrained Videos. In: Corchado, E., Yin, H., Botti, V., Fyfe, C. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2006. IDEAL 2006. Lecture Notes in Computer Science, vol 4224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11875581_99

Download citation

  • DOI: https://doi.org/10.1007/11875581_99

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45485-4

  • Online ISBN: 978-3-540-45487-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics