Toward semantic indexing and retrieval using hierarchical audio models

Chu, Wei-Ta; Cheng, Wen-Huang; Hsu, Jane Yung-Jen; Wu, Ja-Ling

doi:10.1007/s00530-005-0183-6

Toward semantic indexing and retrieval using hierarchical audio models

Regular Paper
Published: 10 May 2005

Volume 10, pages 570–583, (2005)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Wei-Ta Chu¹,
Wen-Huang Cheng²,
Jane Yung-Jen Hsu³ &
…
Ja-Ling Wu³

76 Accesses
8 Citations
Explore all metrics

Abstract

Semantic-level content analysis is a crucial issue in achieving efficient content retrieval and management. We propose a hierarchical approach that models the statistical characteristics of audio events over a time series to accomplish semantic context detection. Two stages, audio event and semantic context modeling, are devised to bridge the semantic gap between physical audio features and semantic concepts. In this work, hidden Markov models (HMMs) are used to model four representative audio events, i.e., gunshot, explosion, engine, and car-braking, in action movies. At the semantic-context level, Gaussian mixture models (GMMs) and ergodic HMMs are investigated to fuse the characteristics and correlations between various audio events. They provide cues for detecting gunplay and car-chasing scenes, two semantic contexts we focus on in this work. The promising experimental results demonstrate the effectiveness of the proposed approach and exhibit that the proposed framework provides a foundation in semantic indexing and retrieval. Moreover, the two fusion schemes are compared, and the relations between audio event and semantic context are studied.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Yeo, B.L., Liu, B.: Rapid scene change detection on compressed video. IEEE Trans. Circuits Syst. Video Technol. 5(6), 533–544 (1995)
Google Scholar
Hanjalic, A.: Shot-boundary detection: unraveled and resolved? IEEE Trans. Circuits Syst. Video Technol. 12(2), 90–105 (2002)
Google Scholar
Li, Y., Zhong, T., Tretter, D.: An overview of video abstraction techniques. Technical Report, HPL–2001–191, Hewlett-Packard, Palo Alto, CA (2001)
Google Scholar
Pfeiffer, S., Lienhart, R., Fischer, S., Effelsberg, W.: Abstracting digital movies automatically. J. Vis. Commun. Image Represent. 7(4), 345–353 (1996)
Google Scholar
Dimitrova, N., Zhang, H.J., Shahraray, B., Huang, T.S., Zakhor, A.: Applications of video-content analysis and retrieval. IEEE Multimedia 9(3), 42–55 (2002)
Article Google Scholar
Lu, L., Zhang, H.J., Jiang, H.: Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Process. 10(7), 504–516 (2002)
Google Scholar
Zhang, T., Kuo, C.C.J.: Hierarchical system for content-based audio classification and retrieval. Proc. SPIE Multimedia Storage Archiv. Syst. III 3527, 398–409 (1998)
Google Scholar
Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002)
Article Google Scholar
Lu, L., Zhang, H.J.: Automatic extraction of music snippets. In: Proceedings of the ACM Multimedia Conference, pp. 140–147 (2003)
Moncrieff, S., Venkatesh, S., Dorai, C.: Horror film genre typing and scene labeling via audio analysis. In: Proceedings of the IEEE International Conference on Multimedia and Expo 2, 193–196 (2003)
Google Scholar
Liu, Z., Huang, J., Wang, Y.: Classification of TV programs based on audio information using hidden Markov model. In: Proceedings of the IEEE Signal Processing Society Workshop on Multimedia Signal Processing. 27–32 (1998)
Wang, Y., Liu, Z., Huang, J.C.: Multimedia content analysis using both audio and visual cues. IEEE Signal Process. Mag. 17(6), pp. 12–36 (2000)
Google Scholar
Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: Proceedings of ACM Multimedia, pp. 533–542 (2002)
Itti, L., Koch, C.: Computational modeling of visual attention. Nature Rev. Neurosci. 2(3), 194–203 (2001)
Google Scholar
Ho, C.C.: A study of effective techniques for user-centric video streaming. Ph.D. dissertation, National Taiwan University (2003)
Ouerhani, N., von Wartburg, R., Hugli, H., Muri, R.: Empirical validation of the saliency-based model of visual attention. Electron. Lett. Comput. Vis. Image Anal. 3(1), 13–24 (2004)
Google Scholar
Cai, R., Lu, L., Zhang, H.J., Cai, L.H.: Highlight sound effects detection in audio stream. In: Proceedings of the IEEE International Conference on Multimedia and Expo, 3, 37–40 (2003)
Google Scholar
Naphade, M.R., Kristjansson, T., Frey, B., Huang, T.S.: Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia system. In: Proceedings of the IEEE International Conference on Image Processing, 3, 536–540 (1998)
Google Scholar
Naphade, M.R., Huang, T.S.: Extracting semantics from audiovisual content: the final frontier in multimedia retrieval. IEEE Trans. Neural Netw. 13(4), 793–810 (2002)
Article Google Scholar
Naphade, M.R., Huang, T.S.: A probabilistic framework for semantic video indexing, filtering, and retrieval. IEEE Trans. Multimedia 3(1), 141–151 (2001)
Google Scholar
Adams, W.H., Iyengar, G., Lin, C.Y., Naphade, M.R., Neti, C., Nock, H.J., Smith, J.R.: Semantic indexing of multimedia content using visual, audio, and text cues. Eurasip J. Appl. Signal Process. 2003(2), 170–185 (2003)
Article Google Scholar
Kschischang, F.R., Frey, B.J.: Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory 47(2), 498–519 (2001)
Article MathSciNet Google Scholar
Sethy, A., Narayanan, S.: Split-lexicon based hierarchical recognition of speech using syllable and word level acoustic units. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 772–775 (2003)
Google Scholar
Stolfo, S., Prodromidis, A., Tselepis, S., Lee, W., Fan, D., Chan, P.: JAM: Java agents for meta-learning over distributed databases. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 74–81 (1997)
Lin, W.-H., Hauptmann, A.: Meta-classification: Combining multimodal classifiers. In: Zaiane, O.R., Simoff, S., Djeraba, C. (eds.) Mining Multimedia and Complex Data, pp. 217–231. Springer, Berlin Heidelberg New York (2003)
Google Scholar
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
Dimitrova, N.: Context and memory in multimedia content analysis. IEEE Multimedia 11(3), 7–11 (2004)
Article Google Scholar
Li, S.Z.: Content-based classification and retrieval of audio using the nearest feature line method. IEEE Trans. Speech Audio Process. 8(5), 619–625 (2000)
Article Google Scholar
Bow, S.T.: Pattern Recognition and Image Preprocessing. Marcel Dekker, New York (2002)
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (2001)
Google Scholar
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)
Article Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39, 1–38 (1977)
MathSciNet Google Scholar
Dorai, C., Venkatesh, S.: Media Computing: Computational Media Aesthetics. Kluwer, Dordrecht (2002)
Google Scholar
Zettl, H.: Sight Sound Motion: Applied Media Aesthetics, 3rd edn. Wadsworth, Belmont, CA (1999)
Google Scholar
Wang, J., Xu, C., Chng, E., Tian, Q.: Sports highlight detection from keyword Sequences using HMM In: Proceedings of the IEEE International Conference on Multimedia and Expo (2004)
Naphade, M.R., Garg, A., Huang, T.S.: Audio-visual event detection using duration dependent input output Markov models. In: Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, pp. 39–43 (2001)
Cardoso, J.F.: Blind signal separation: satistical principles. Proc. IEEE 9(10), 2009–2025 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, 106
Wei-Ta Chu
Graduate Institute of Networking and Multimedia, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, 106
Wen-Huang Cheng
Department of Computer Science and Information Engineering; Graduate Institute of Networking and Multimedia, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, 106
Jane Yung-Jen Hsu & Ja-Ling Wu

Authors

Wei-Ta Chu
View author publications
Search author on:PubMed Google Scholar
Wen-Huang Cheng
View author publications
Search author on:PubMed Google Scholar
Jane Yung-Jen Hsu
View author publications
Search author on:PubMed Google Scholar
Ja-Ling Wu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Wei-Ta Chu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chu, WT., Cheng, WH., Hsu, J.YJ. et al. Toward semantic indexing and retrieval using hierarchical audio models. Multimedia Systems 10, 570–583 (2005). https://doi.org/10.1007/s00530-005-0183-6

Download citation

Received: 16 April 2004
Revised: 20 November 2004
Published: 10 May 2005
Issue Date: October 2005
DOI: https://doi.org/10.1007/s00530-005-0183-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Toward semantic indexing and retrieval using hierarchical audio models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Human–machine collaboration based sound event detection

Statistical Methods for Scene and Event Classification

HMM-GMM Acoustic Modeling for Arabic Speech Recognition System

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Toward semantic indexing and retrieval using hierarchical audio models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Human–machine collaboration based sound event detection

Statistical Methods for Scene and Event Classification

HMM-GMM Acoustic Modeling for Arabic Speech Recognition System

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now