Abstract
Semantic-level content analysis is a crucial issue in achieving efficient content retrieval and management. We propose a hierarchical approach that models the statistical characteristics of audio events over a time series to accomplish semantic context detection. Two stages, audio event and semantic context modeling, are devised to bridge the semantic gap between physical audio features and semantic concepts. In this work, hidden Markov models (HMMs) are used to model four representative audio events, i.e., gunshot, explosion, engine, and car-braking, in action movies. At the semantic-context level, Gaussian mixture models (GMMs) and ergodic HMMs are investigated to fuse the characteristics and correlations between various audio events. They provide cues for detecting gunplay and car-chasing scenes, two semantic contexts we focus on in this work. The promising experimental results demonstrate the effectiveness of the proposed approach and exhibit that the proposed framework provides a foundation in semantic indexing and retrieval. Moreover, the two fusion schemes are compared, and the relations between audio event and semantic context are studied.
Similar content being viewed by others
References
Yeo, B.L., Liu, B.: Rapid scene change detection on compressed video. IEEE Trans. Circuits Syst. Video Technol. 5(6), 533–544 (1995)
Hanjalic, A.: Shot-boundary detection: unraveled and resolved? IEEE Trans. Circuits Syst. Video Technol. 12(2), 90–105 (2002)
Li, Y., Zhong, T., Tretter, D.: An overview of video abstraction techniques. Technical Report, HPL–2001–191, Hewlett-Packard, Palo Alto, CA (2001)
Pfeiffer, S., Lienhart, R., Fischer, S., Effelsberg, W.: Abstracting digital movies automatically. J. Vis. Commun. Image Represent. 7(4), 345–353 (1996)
Dimitrova, N., Zhang, H.J., Shahraray, B., Huang, T.S., Zakhor, A.: Applications of video-content analysis and retrieval. IEEE Multimedia 9(3), 42–55 (2002)
Lu, L., Zhang, H.J., Jiang, H.: Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Process. 10(7), 504–516 (2002)
Zhang, T., Kuo, C.C.J.: Hierarchical system for content-based audio classification and retrieval. Proc. SPIE Multimedia Storage Archiv. Syst. III 3527, 398–409 (1998)
Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002)
Lu, L., Zhang, H.J.: Automatic extraction of music snippets. In: Proceedings of the ACM Multimedia Conference, pp. 140–147 (2003)
Moncrieff, S., Venkatesh, S., Dorai, C.: Horror film genre typing and scene labeling via audio analysis. In: Proceedings of the IEEE International Conference on Multimedia and Expo 2, 193–196 (2003)
Liu, Z., Huang, J., Wang, Y.: Classification of TV programs based on audio information using hidden Markov model. In: Proceedings of the IEEE Signal Processing Society Workshop on Multimedia Signal Processing. 27–32 (1998)
Wang, Y., Liu, Z., Huang, J.C.: Multimedia content analysis using both audio and visual cues. IEEE Signal Process. Mag. 17(6), pp. 12–36 (2000)
Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: Proceedings of ACM Multimedia, pp. 533–542 (2002)
Itti, L., Koch, C.: Computational modeling of visual attention. Nature Rev. Neurosci. 2(3), 194–203 (2001)
Ho, C.C.: A study of effective techniques for user-centric video streaming. Ph.D. dissertation, National Taiwan University (2003)
Ouerhani, N., von Wartburg, R., Hugli, H., Muri, R.: Empirical validation of the saliency-based model of visual attention. Electron. Lett. Comput. Vis. Image Anal. 3(1), 13–24 (2004)
Cai, R., Lu, L., Zhang, H.J., Cai, L.H.: Highlight sound effects detection in audio stream. In: Proceedings of the IEEE International Conference on Multimedia and Expo, 3, 37–40 (2003)
Naphade, M.R., Kristjansson, T., Frey, B., Huang, T.S.: Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia system. In: Proceedings of the IEEE International Conference on Image Processing, 3, 536–540 (1998)
Naphade, M.R., Huang, T.S.: Extracting semantics from audiovisual content: the final frontier in multimedia retrieval. IEEE Trans. Neural Netw. 13(4), 793–810 (2002)
Naphade, M.R., Huang, T.S.: A probabilistic framework for semantic video indexing, filtering, and retrieval. IEEE Trans. Multimedia 3(1), 141–151 (2001)
Adams, W.H., Iyengar, G., Lin, C.Y., Naphade, M.R., Neti, C., Nock, H.J., Smith, J.R.: Semantic indexing of multimedia content using visual, audio, and text cues. Eurasip J. Appl. Signal Process. 2003(2), 170–185 (2003)
Kschischang, F.R., Frey, B.J.: Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory 47(2), 498–519 (2001)
Sethy, A., Narayanan, S.: Split-lexicon based hierarchical recognition of speech using syllable and word level acoustic units. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 772–775 (2003)
Stolfo, S., Prodromidis, A., Tselepis, S., Lee, W., Fan, D., Chan, P.: JAM: Java agents for meta-learning over distributed databases. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 74–81 (1997)
Lin, W.-H., Hauptmann, A.: Meta-classification: Combining multimodal classifiers. In: Zaiane, O.R., Simoff, S., Djeraba, C. (eds.) Mining Multimedia and Complex Data, pp. 217–231. Springer, Berlin Heidelberg New York (2003)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Dimitrova, N.: Context and memory in multimedia content analysis. IEEE Multimedia 11(3), 7–11 (2004)
Li, S.Z.: Content-based classification and retrieval of audio using the nearest feature line method. IEEE Trans. Speech Audio Process. 8(5), 619–625 (2000)
Bow, S.T.: Pattern Recognition and Image Preprocessing. Marcel Dekker, New York (2002)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (2001)
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39, 1–38 (1977)
Dorai, C., Venkatesh, S.: Media Computing: Computational Media Aesthetics. Kluwer, Dordrecht (2002)
Zettl, H.: Sight Sound Motion: Applied Media Aesthetics, 3rd edn. Wadsworth, Belmont, CA (1999)
Wang, J., Xu, C., Chng, E., Tian, Q.: Sports highlight detection from keyword Sequences using HMM In: Proceedings of the IEEE International Conference on Multimedia and Expo (2004)
Naphade, M.R., Garg, A., Huang, T.S.: Audio-visual event detection using duration dependent input output Markov models. In: Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, pp. 39–43 (2001)
Cardoso, J.F.: Blind signal separation: satistical principles. Proc. IEEE 9(10), 2009–2025 (1998)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chu, WT., Cheng, WH., Hsu, J.YJ. et al. Toward semantic indexing and retrieval using hierarchical audio models. Multimedia Systems 10, 570–583 (2005). https://doi.org/10.1007/s00530-005-0183-6
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-005-0183-6