Skip to main content

Advertisement

Log in

Toward semantic indexing and retrieval using hierarchical audio models

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Semantic-level content analysis is a crucial issue in achieving efficient content retrieval and management. We propose a hierarchical approach that models the statistical characteristics of audio events over a time series to accomplish semantic context detection. Two stages, audio event and semantic context modeling, are devised to bridge the semantic gap between physical audio features and semantic concepts. In this work, hidden Markov models (HMMs) are used to model four representative audio events, i.e., gunshot, explosion, engine, and car-braking, in action movies. At the semantic-context level, Gaussian mixture models (GMMs) and ergodic HMMs are investigated to fuse the characteristics and correlations between various audio events. They provide cues for detecting gunplay and car-chasing scenes, two semantic contexts we focus on in this work. The promising experimental results demonstrate the effectiveness of the proposed approach and exhibit that the proposed framework provides a foundation in semantic indexing and retrieval. Moreover, the two fusion schemes are compared, and the relations between audio event and semantic context are studied.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Yeo, B.L., Liu, B.: Rapid scene change detection on compressed video. IEEE Trans. Circuits Syst. Video Technol. 5(6), 533–544 (1995)

    Google Scholar 

  2. Hanjalic, A.: Shot-boundary detection: unraveled and resolved? IEEE Trans. Circuits Syst. Video Technol. 12(2), 90–105 (2002)

    Google Scholar 

  3. Li, Y., Zhong, T., Tretter, D.: An overview of video abstraction techniques. Technical Report, HPL–2001–191, Hewlett-Packard, Palo Alto, CA (2001)

    Google Scholar 

  4. Pfeiffer, S., Lienhart, R., Fischer, S., Effelsberg, W.: Abstracting digital movies automatically. J. Vis. Commun. Image Represent. 7(4), 345–353 (1996)

    Google Scholar 

  5. Dimitrova, N., Zhang, H.J., Shahraray, B., Huang, T.S., Zakhor, A.: Applications of video-content analysis and retrieval. IEEE Multimedia 9(3), 42–55 (2002)

    Article  Google Scholar 

  6. Lu, L., Zhang, H.J., Jiang, H.: Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Process. 10(7), 504–516 (2002)

    Google Scholar 

  7. Zhang, T., Kuo, C.C.J.: Hierarchical system for content-based audio classification and retrieval. Proc. SPIE Multimedia Storage Archiv. Syst. III 3527, 398–409 (1998)

    Google Scholar 

  8. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002)

    Article  Google Scholar 

  9. Lu, L., Zhang, H.J.: Automatic extraction of music snippets. In: Proceedings of the ACM Multimedia Conference, pp. 140–147 (2003)

  10. Moncrieff, S., Venkatesh, S., Dorai, C.: Horror film genre typing and scene labeling via audio analysis. In: Proceedings of the IEEE International Conference on Multimedia and Expo 2, 193–196 (2003)

    Google Scholar 

  11. Liu, Z., Huang, J., Wang, Y.: Classification of TV programs based on audio information using hidden Markov model. In: Proceedings of the IEEE Signal Processing Society Workshop on Multimedia Signal Processing. 27–32 (1998)

  12. Wang, Y., Liu, Z., Huang, J.C.: Multimedia content analysis using both audio and visual cues. IEEE Signal Process. Mag. 17(6), pp. 12–36 (2000)

    Google Scholar 

  13. Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: Proceedings of ACM Multimedia, pp. 533–542 (2002)

  14. Itti, L., Koch, C.: Computational modeling of visual attention. Nature Rev. Neurosci. 2(3), 194–203 (2001)

    Google Scholar 

  15. Ho, C.C.: A study of effective techniques for user-centric video streaming. Ph.D. dissertation, National Taiwan University (2003)

  16. Ouerhani, N., von Wartburg, R., Hugli, H., Muri, R.: Empirical validation of the saliency-based model of visual attention. Electron. Lett. Comput. Vis. Image Anal. 3(1), 13–24 (2004)

    Google Scholar 

  17. Cai, R., Lu, L., Zhang, H.J., Cai, L.H.: Highlight sound effects detection in audio stream. In: Proceedings of the IEEE International Conference on Multimedia and Expo, 3, 37–40 (2003)

    Google Scholar 

  18. Naphade, M.R., Kristjansson, T., Frey, B., Huang, T.S.: Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia system. In: Proceedings of the IEEE International Conference on Image Processing, 3, 536–540 (1998)

    Google Scholar 

  19. Naphade, M.R., Huang, T.S.: Extracting semantics from audiovisual content: the final frontier in multimedia retrieval. IEEE Trans. Neural Netw. 13(4), 793–810 (2002)

    Article  Google Scholar 

  20. Naphade, M.R., Huang, T.S.: A probabilistic framework for semantic video indexing, filtering, and retrieval. IEEE Trans. Multimedia 3(1), 141–151 (2001)

    Google Scholar 

  21. Adams, W.H., Iyengar, G., Lin, C.Y., Naphade, M.R., Neti, C., Nock, H.J., Smith, J.R.: Semantic indexing of multimedia content using visual, audio, and text cues. Eurasip J. Appl. Signal Process. 2003(2), 170–185 (2003)

    Article  Google Scholar 

  22. Kschischang, F.R., Frey, B.J.: Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory 47(2), 498–519 (2001)

    Article  MathSciNet  Google Scholar 

  23. Sethy, A., Narayanan, S.: Split-lexicon based hierarchical recognition of speech using syllable and word level acoustic units. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 772–775 (2003)

    Google Scholar 

  24. Stolfo, S., Prodromidis, A., Tselepis, S., Lee, W., Fan, D., Chan, P.: JAM: Java agents for meta-learning over distributed databases. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 74–81 (1997)

  25. Lin, W.-H., Hauptmann, A.: Meta-classification: Combining multimodal classifiers. In: Zaiane, O.R., Simoff, S., Djeraba, C. (eds.) Mining Multimedia and Complex Data, pp. 217–231. Springer, Berlin Heidelberg New York (2003)

    Google Scholar 

  26. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  27. Dimitrova, N.: Context and memory in multimedia content analysis. IEEE Multimedia 11(3), 7–11 (2004)

    Article  Google Scholar 

  28. Li, S.Z.: Content-based classification and retrieval of audio using the nearest feature line method. IEEE Trans. Speech Audio Process. 8(5), 619–625 (2000)

    Article  Google Scholar 

  29. Bow, S.T.: Pattern Recognition and Image Preprocessing. Marcel Dekker, New York (2002)

    Google Scholar 

  30. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (2001)

    Google Scholar 

  31. Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)

    Article  Google Scholar 

  32. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39, 1–38 (1977)

    MathSciNet  Google Scholar 

  33. Dorai, C., Venkatesh, S.: Media Computing: Computational Media Aesthetics. Kluwer, Dordrecht (2002)

    Google Scholar 

  34. Zettl, H.: Sight Sound Motion: Applied Media Aesthetics, 3rd edn. Wadsworth, Belmont, CA (1999)

    Google Scholar 

  35. Wang, J., Xu, C., Chng, E., Tian, Q.: Sports highlight detection from keyword Sequences using HMM In: Proceedings of the IEEE International Conference on Multimedia and Expo (2004)

  36. Naphade, M.R., Garg, A., Huang, T.S.: Audio-visual event detection using duration dependent input output Markov models. In: Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, pp. 39–43 (2001)

  37. Cardoso, J.F.: Blind signal separation: satistical principles. Proc. IEEE 9(10), 2009–2025 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Ta Chu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chu, WT., Cheng, WH., Hsu, J.YJ. et al. Toward semantic indexing and retrieval using hierarchical audio models. Multimedia Systems 10, 570–583 (2005). https://doi.org/10.1007/s00530-005-0183-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-005-0183-6

Keywords