Article

Unsupervised content discovery in composite audio

Authors:
Rui Cai

Tsinghua Univ., Beijing, China

Tsinghua Univ., Beijing, China
View Profile

,
Lie Lu

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Alan Hanjalic

Delft University of Technology, Delft, The Netherlands

Delft University of Technology, Delft, The Netherlands
View Profile

MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on MultimediaNovember 2005Pages 628–637https://doi.org/10.1145/1101149.1101292

Published:06 November 2005Publication History

MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia

Pages 628–637

ABSTRACT

Automatically extracting semantic content from audio streams can be helpful in many multimedia applications. Motivated by the known limitations of traditional supervised approaches to content extraction, which are hard to generalize and require suitable training data, we propose in this paper an unsupervised approach to discover and categorize semantic content in a composite audio stream. In our approach, we first employ spectral clustering to discover natural semantic sound clusters in the analyzed data stream (e.g. speech, music, noise, applause, speech mixed with music, etc.). These clusters are referred to as audio elements. Based on the obtained set of audio elements, the key audio elements, which are most prominent in characterizing the content of input audio data, are selected and used to detect potential boundaries of semantic audio segments denoted as auditory scenes. Finally, the auditory scenes are categorized in terms of the audio elements appearing therein. Categorization is inferred from the relations between audio elements and auditory scenes by using the information-theoretic co-clustering scheme. Evaluations of the proposed approach performed on 4 hours of diverse audio data indicate that promising results can be achieved, both regarding audio element discovery and auditory scene categorization.

References

Baeza-Yates, R., and Ribeiro-Neto, B. Modern Information Retrieval. Addison-Wesley, Boston, MA, 1999. Google ScholarDigital Library
Cai, R., Lu, L., Zhang, H.-J., and Cai, L.-H. Highlight sound effects detection in audio stream. In Proc. of the 4th IEEE International Conference on Multimedia and Expo, 2003, vol. 3, 37--40. Google ScholarDigital Library
Cai, R., Lu, L., Zhang, H.-J., and Cai, L.-H. Improve audio representation by using feature structure patterns. In Proc. of the 29th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, vol. 4, 345--348.Google Scholar
Cai, R., Lu, L., and Cai, L.-H. Unsupervised auditory scene categorization via key audio effects and information-theoretic co-clustering. In Proc. of the 30th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, vol. 2, 1073--1076.Google Scholar
Cai, R., Lu, L., Hanjalic, A., Zhang, H.-J., and Cai, L.-H. A flexible framework for key audio effects detection and auditory context inference. to appear in IEEE Trans. Speech Audio Processing, May, 2006. Google ScholarDigital Library
Cheng, W.-H., Chu, W.-T., and Wu, J.-L. Semantic context detection based on hierarchical audio models. In Proc. of the 5th ACM SIGMM International Workshop on Multimedia Information Retrieval, 2003, 109--115. Google ScholarDigital Library
Dhillon, I. S., Mallela, S., and Modha, D. S. Information-theoretic co-clustering. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, 89--98. Google ScholarDigital Library
Dhillon, I. S., and Guan, Y. Information theoretic clustering of sparse co-occurrence data. In Proc. of the 3rd IEEE International Conference on Data Mining, 2003, 517--520. Google ScholarDigital Library
Duda, R. O., Hart, P. E., and Stork, D. G. Pattern Classification, Second Edition. John Wiley & Sons, NJ, 2000. Google ScholarDigital Library
Ellis, D., and Lee, K. Minimal-impact audio-based personal archives. In Proc. of ACM Workshop on Continuous Archival and Retrieval of Personal Experiences, 2004, 39--47. Google ScholarDigital Library
Hanjalic, A., Lagendijk, R. L., and Biemond, J. Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits and Systems for Video Technology, vol. 9, no. 4, pp. 580--588, Jun. 1999. Google ScholarDigital Library
Hanjalic, A., and Xu, L.-Q. Affective video content representation and modeling. IEEE Trans. Multimedia, vol. 7, no. 1, pp. 143--154, Feb. 2005. Google ScholarDigital Library
Kender, J. R., and Yeo, B.-L. Video scene segmentation via continuous video coherence. In Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1998, 367--373. Google ScholarDigital Library
Lu, L., Cai, R., and Hanjalic, A. Towards a unified framework for content-based audio analysis. In Proc. of the 30th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, vol. 2, 1069--1072.Google ScholarCross Ref
Lu, L., Zhang, H.-J., and Jiang, H. Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Processing, vol. 10, no. 7, pp. 504--516, Oct. 2002.Google ScholarCross Ref
Ma, Y.-F., Lu, L., Zhang, H.-J., and Li, M.-J. A user attention model for video summarization. In Proc. of ACM International Conference on Multimedia, 2002, 533--542. Google ScholarDigital Library
Moncrieff, S., Dorai, C., and Venkatesh, S. Detecting indexical signs in film audio for scene interpretation. In Proc. of the 2nd IEEE International Conference on Multimedia and Expo, 2001, 989--992.Google ScholarCross Ref
Ng, A. Y., Jordan, M. I., and Weiss, Y. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems 14 (Proc. of NIPS 2001), 849--856.Google Scholar
Ngo, C.-W., Ma, Y.-F., and Zhang, H.-J. Video summarization and scene detection by graph modeling. IEEE Trans. Circuits and Systems for Video Technology, vol. 15, no. 2, pp. 296--305, Feb. 2005. Google ScholarDigital Library
Pelleg, D., and Moore, A. W. X-means: extending K-means with efficient estimation of the number of clusters. In Proc. of the 17th International Conference on Machine Learning, 2000, 727--734. Google ScholarDigital Library
Peltonen, V., Tuomi, J., Klapuri, A. P., Huopaniemi, J., and Sorsa, T. Computational auditory scene recognition. In Proc. of the 27th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, vol. 2, 1941--1944.Google Scholar
Radhakrishnan, R., Divakaran, A., and Xiong, Z. A time series clustering based framework for multimedia mining and summarization using audio features. In Proc. of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, 2004, 157--164. Google ScholarDigital Library
Sundaram, H., and Chang, S.-F. Determining Computable scenes in films and their structures using audio visual memory models. In Proc. of the 8th ACM International Conference on Multimedia, 2000, 95--104. Google ScholarDigital Library
Xie, L., Chang, S.-F., Divakaran, A., and Sun H. Unsupervised mining of statistical temporal structures in video. Video Mining, Kluwer Academic Publishers, 2003, 279--307.Google Scholar
Xu, M., Maddage, N., Xu, C.-S., Kankanhalli, M., and Tian, Q. Creating audio keywords for event detection in soccer video. In Proc. of the 4th IEEE International Conference on Multimedia and Expo, 2003, vol. 2, 281--284. Google ScholarDigital Library
Yu, S. X., and Shi, J. Multiclass spectral clustering. In Proc. of the 9th IEEE International Conference on Computer Vision, 2003, vol. 1, 313--319. Google ScholarDigital Library
Zelnik-Manor, L., and Perona, P. Self-tuning spectral clustering. Advances in Neural Information Processing Systems 17 (Proc. of NIPS 2004), 1601--1608.Google Scholar

Index Terms

Unsupervised content discovery in composite audio

Recommendations

Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval

Inspired by classical text document analysis employing the concept of (key) words, this paper presents an unsupervised approach to discover (key) audio elements in general audio documents. The (key) audio elements can be considered the equivalents of ...
Read More
Towards optimal audio "keywords" detection for audio content analysis and discovery
MM '06: Proceedings of the 14th ACM international conference on Multimedia

Natural semantic sound clusters in an audio document, also referred to as audio elements, can be seen as an analogy to words in a text document. Based on the obtained set of audio elements, the key audio elements, or audio "keywords", can be detected, ...
Read More
Text-like segmentation of general audio for content-based retrieval

Automatic detection of (semantically) meaningful audio segments, or audio scenes, is an important step in high-level semantic inference from general audio signals, and can benefit various content-based applications involving both audio and multimodal (...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia
November 2005
1110 pages
ISBN:1595930442
DOI:10.1145/1101149
General Chairs:
Hongjiang Zhang
Microsoft Research Asia, China
,
Tat-Seng Chua
National University of Singapore, Singapore
,
Program Chairs:
Ralf Steinmetz
Technische Universitat Darmstadt, Germany
,
Mohan Kankanhalli
National University of Singapore, Singapore
,
Lynn Wilcox
FXPAL
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 November 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
auditory scene
content-based audio analysis
information-theoretic co-clustering
key audio element
spectral clustering
unsupervised approach
Qualifiers
- Article
Conference

Acceptance Rates
MULTIMEDIA '05 Paper Acceptance Rate49of312submissions,16%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 656
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Unsupervised content discovery in composite audio

MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval

Towards optimal audio "keywords" detection for audio content analysis and discovery

Text-like segmentation of general audio for content-based retrieval