ABSTRACT
Automatically extracting semantic content from audio streams can be helpful in many multimedia applications. Motivated by the known limitations of traditional supervised approaches to content extraction, which are hard to generalize and require suitable training data, we propose in this paper an unsupervised approach to discover and categorize semantic content in a composite audio stream. In our approach, we first employ spectral clustering to discover natural semantic sound clusters in the analyzed data stream (e.g. speech, music, noise, applause, speech mixed with music, etc.). These clusters are referred to as audio elements. Based on the obtained set of audio elements, the key audio elements, which are most prominent in characterizing the content of input audio data, are selected and used to detect potential boundaries of semantic audio segments denoted as auditory scenes. Finally, the auditory scenes are categorized in terms of the audio elements appearing therein. Categorization is inferred from the relations between audio elements and auditory scenes by using the information-theoretic co-clustering scheme. Evaluations of the proposed approach performed on 4 hours of diverse audio data indicate that promising results can be achieved, both regarding audio element discovery and auditory scene categorization.
- Baeza-Yates, R., and Ribeiro-Neto, B. Modern Information Retrieval. Addison-Wesley, Boston, MA, 1999. Google ScholarDigital Library
- Cai, R., Lu, L., Zhang, H.-J., and Cai, L.-H. Highlight sound effects detection in audio stream. In Proc. of the 4th IEEE International Conference on Multimedia and Expo, 2003, vol. 3, 37--40. Google ScholarDigital Library
- Cai, R., Lu, L., Zhang, H.-J., and Cai, L.-H. Improve audio representation by using feature structure patterns. In Proc. of the 29th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, vol. 4, 345--348.Google Scholar
- Cai, R., Lu, L., and Cai, L.-H. Unsupervised auditory scene categorization via key audio effects and information-theoretic co-clustering. In Proc. of the 30th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, vol. 2, 1073--1076.Google Scholar
- Cai, R., Lu, L., Hanjalic, A., Zhang, H.-J., and Cai, L.-H. A flexible framework for key audio effects detection and auditory context inference. to appear in IEEE Trans. Speech Audio Processing, May, 2006. Google ScholarDigital Library
- Cheng, W.-H., Chu, W.-T., and Wu, J.-L. Semantic context detection based on hierarchical audio models. In Proc. of the 5th ACM SIGMM International Workshop on Multimedia Information Retrieval, 2003, 109--115. Google ScholarDigital Library
- Dhillon, I. S., Mallela, S., and Modha, D. S. Information-theoretic co-clustering. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, 89--98. Google ScholarDigital Library
- Dhillon, I. S., and Guan, Y. Information theoretic clustering of sparse co-occurrence data. In Proc. of the 3rd IEEE International Conference on Data Mining, 2003, 517--520. Google ScholarDigital Library
- Duda, R. O., Hart, P. E., and Stork, D. G. Pattern Classification, Second Edition. John Wiley & Sons, NJ, 2000. Google ScholarDigital Library
- Ellis, D., and Lee, K. Minimal-impact audio-based personal archives. In Proc. of ACM Workshop on Continuous Archival and Retrieval of Personal Experiences, 2004, 39--47. Google ScholarDigital Library
- Hanjalic, A., Lagendijk, R. L., and Biemond, J. Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits and Systems for Video Technology, vol. 9, no. 4, pp. 580--588, Jun. 1999. Google ScholarDigital Library
- Hanjalic, A., and Xu, L.-Q. Affective video content representation and modeling. IEEE Trans. Multimedia, vol. 7, no. 1, pp. 143--154, Feb. 2005. Google ScholarDigital Library
- Kender, J. R., and Yeo, B.-L. Video scene segmentation via continuous video coherence. In Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1998, 367--373. Google ScholarDigital Library
- Lu, L., Cai, R., and Hanjalic, A. Towards a unified framework for content-based audio analysis. In Proc. of the 30th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, vol. 2, 1069--1072.Google ScholarCross Ref
- Lu, L., Zhang, H.-J., and Jiang, H. Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Processing, vol. 10, no. 7, pp. 504--516, Oct. 2002.Google ScholarCross Ref
- Ma, Y.-F., Lu, L., Zhang, H.-J., and Li, M.-J. A user attention model for video summarization. In Proc. of ACM International Conference on Multimedia, 2002, 533--542. Google ScholarDigital Library
- Moncrieff, S., Dorai, C., and Venkatesh, S. Detecting indexical signs in film audio for scene interpretation. In Proc. of the 2nd IEEE International Conference on Multimedia and Expo, 2001, 989--992.Google ScholarCross Ref
- Ng, A. Y., Jordan, M. I., and Weiss, Y. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems 14 (Proc. of NIPS 2001), 849--856.Google Scholar
- Ngo, C.-W., Ma, Y.-F., and Zhang, H.-J. Video summarization and scene detection by graph modeling. IEEE Trans. Circuits and Systems for Video Technology, vol. 15, no. 2, pp. 296--305, Feb. 2005. Google ScholarDigital Library
- Pelleg, D., and Moore, A. W. X-means: extending K-means with efficient estimation of the number of clusters. In Proc. of the 17th International Conference on Machine Learning, 2000, 727--734. Google ScholarDigital Library
- Peltonen, V., Tuomi, J., Klapuri, A. P., Huopaniemi, J., and Sorsa, T. Computational auditory scene recognition. In Proc. of the 27th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, vol. 2, 1941--1944.Google Scholar
- Radhakrishnan, R., Divakaran, A., and Xiong, Z. A time series clustering based framework for multimedia mining and summarization using audio features. In Proc. of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, 2004, 157--164. Google ScholarDigital Library
- Sundaram, H., and Chang, S.-F. Determining Computable scenes in films and their structures using audio visual memory models. In Proc. of the 8th ACM International Conference on Multimedia, 2000, 95--104. Google ScholarDigital Library
- Xie, L., Chang, S.-F., Divakaran, A., and Sun H. Unsupervised mining of statistical temporal structures in video. Video Mining, Kluwer Academic Publishers, 2003, 279--307.Google Scholar
- Xu, M., Maddage, N., Xu, C.-S., Kankanhalli, M., and Tian, Q. Creating audio keywords for event detection in soccer video. In Proc. of the 4th IEEE International Conference on Multimedia and Expo, 2003, vol. 2, 281--284. Google ScholarDigital Library
- Yu, S. X., and Shi, J. Multiclass spectral clustering. In Proc. of the 9th IEEE International Conference on Computer Vision, 2003, vol. 1, 313--319. Google ScholarDigital Library
- Zelnik-Manor, L., and Perona, P. Self-tuning spectral clustering. Advances in Neural Information Processing Systems 17 (Proc. of NIPS 2004), 1601--1608.Google Scholar
Index Terms
- Unsupervised content discovery in composite audio
Recommendations
Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval
Inspired by classical text document analysis employing the concept of (key) words, this paper presents an unsupervised approach to discover (key) audio elements in general audio documents. The (key) audio elements can be considered the equivalents of ...
Towards optimal audio "keywords" detection for audio content analysis and discovery
MM '06: Proceedings of the 14th ACM international conference on MultimediaNatural semantic sound clusters in an audio document, also referred to as audio elements, can be seen as an analogy to words in a text document. Based on the obtained set of audio elements, the key audio elements, or audio "keywords", can be detected, ...
Text-like segmentation of general audio for content-based retrieval
Automatic detection of (semantically) meaningful audio segments, or audio scenes, is an important step in high-level semantic inference from general audio signals, and can benefit various content-based applications involving both audio and multimodal (...
Comments