skip to main content
10.1145/1101149.1101292acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
Article

Unsupervised content discovery in composite audio

Published:06 November 2005Publication History

ABSTRACT

Automatically extracting semantic content from audio streams can be helpful in many multimedia applications. Motivated by the known limitations of traditional supervised approaches to content extraction, which are hard to generalize and require suitable training data, we propose in this paper an unsupervised approach to discover and categorize semantic content in a composite audio stream. In our approach, we first employ spectral clustering to discover natural semantic sound clusters in the analyzed data stream (e.g. speech, music, noise, applause, speech mixed with music, etc.). These clusters are referred to as audio elements. Based on the obtained set of audio elements, the key audio elements, which are most prominent in characterizing the content of input audio data, are selected and used to detect potential boundaries of semantic audio segments denoted as auditory scenes. Finally, the auditory scenes are categorized in terms of the audio elements appearing therein. Categorization is inferred from the relations between audio elements and auditory scenes by using the information-theoretic co-clustering scheme. Evaluations of the proposed approach performed on 4 hours of diverse audio data indicate that promising results can be achieved, both regarding audio element discovery and auditory scene categorization.

References

  1. Baeza-Yates, R., and Ribeiro-Neto, B. Modern Information Retrieval. Addison-Wesley, Boston, MA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Cai, R., Lu, L., Zhang, H.-J., and Cai, L.-H. Highlight sound effects detection in audio stream. In Proc. of the 4th IEEE International Conference on Multimedia and Expo, 2003, vol. 3, 37--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cai, R., Lu, L., Zhang, H.-J., and Cai, L.-H. Improve audio representation by using feature structure patterns. In Proc. of the 29th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, vol. 4, 345--348.Google ScholarGoogle Scholar
  4. Cai, R., Lu, L., and Cai, L.-H. Unsupervised auditory scene categorization via key audio effects and information-theoretic co-clustering. In Proc. of the 30th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, vol. 2, 1073--1076.Google ScholarGoogle Scholar
  5. Cai, R., Lu, L., Hanjalic, A., Zhang, H.-J., and Cai, L.-H. A flexible framework for key audio effects detection and auditory context inference. to appear in IEEE Trans. Speech Audio Processing, May, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cheng, W.-H., Chu, W.-T., and Wu, J.-L. Semantic context detection based on hierarchical audio models. In Proc. of the 5th ACM SIGMM International Workshop on Multimedia Information Retrieval, 2003, 109--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dhillon, I. S., Mallela, S., and Modha, D. S. Information-theoretic co-clustering. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, 89--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dhillon, I. S., and Guan, Y. Information theoretic clustering of sparse co-occurrence data. In Proc. of the 3rd IEEE International Conference on Data Mining, 2003, 517--520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Duda, R. O., Hart, P. E., and Stork, D. G. Pattern Classification, Second Edition. John Wiley & Sons, NJ, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ellis, D., and Lee, K. Minimal-impact audio-based personal archives. In Proc. of ACM Workshop on Continuous Archival and Retrieval of Personal Experiences, 2004, 39--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hanjalic, A., Lagendijk, R. L., and Biemond, J. Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits and Systems for Video Technology, vol. 9, no. 4, pp. 580--588, Jun. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hanjalic, A., and Xu, L.-Q. Affective video content representation and modeling. IEEE Trans. Multimedia, vol. 7, no. 1, pp. 143--154, Feb. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kender, J. R., and Yeo, B.-L. Video scene segmentation via continuous video coherence. In Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1998, 367--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Lu, L., Cai, R., and Hanjalic, A. Towards a unified framework for content-based audio analysis. In Proc. of the 30th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, vol. 2, 1069--1072.Google ScholarGoogle ScholarCross RefCross Ref
  15. Lu, L., Zhang, H.-J., and Jiang, H. Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Processing, vol. 10, no. 7, pp. 504--516, Oct. 2002.Google ScholarGoogle ScholarCross RefCross Ref
  16. Ma, Y.-F., Lu, L., Zhang, H.-J., and Li, M.-J. A user attention model for video summarization. In Proc. of ACM International Conference on Multimedia, 2002, 533--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Moncrieff, S., Dorai, C., and Venkatesh, S. Detecting indexical signs in film audio for scene interpretation. In Proc. of the 2nd IEEE International Conference on Multimedia and Expo, 2001, 989--992.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ng, A. Y., Jordan, M. I., and Weiss, Y. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems 14 (Proc. of NIPS 2001), 849--856.Google ScholarGoogle Scholar
  19. Ngo, C.-W., Ma, Y.-F., and Zhang, H.-J. Video summarization and scene detection by graph modeling. IEEE Trans. Circuits and Systems for Video Technology, vol. 15, no. 2, pp. 296--305, Feb. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Pelleg, D., and Moore, A. W. X-means: extending K-means with efficient estimation of the number of clusters. In Proc. of the 17th International Conference on Machine Learning, 2000, 727--734. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Peltonen, V., Tuomi, J., Klapuri, A. P., Huopaniemi, J., and Sorsa, T. Computational auditory scene recognition. In Proc. of the 27th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, vol. 2, 1941--1944.Google ScholarGoogle Scholar
  22. Radhakrishnan, R., Divakaran, A., and Xiong, Z. A time series clustering based framework for multimedia mining and summarization using audio features. In Proc. of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, 2004, 157--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sundaram, H., and Chang, S.-F. Determining Computable scenes in films and their structures using audio visual memory models. In Proc. of the 8th ACM International Conference on Multimedia, 2000, 95--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xie, L., Chang, S.-F., Divakaran, A., and Sun H. Unsupervised mining of statistical temporal structures in video. Video Mining, Kluwer Academic Publishers, 2003, 279--307.Google ScholarGoogle Scholar
  25. Xu, M., Maddage, N., Xu, C.-S., Kankanhalli, M., and Tian, Q. Creating audio keywords for event detection in soccer video. In Proc. of the 4th IEEE International Conference on Multimedia and Expo, 2003, vol. 2, 281--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yu, S. X., and Shi, J. Multiclass spectral clustering. In Proc. of the 9th IEEE International Conference on Computer Vision, 2003, vol. 1, 313--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Zelnik-Manor, L., and Perona, P. Self-tuning spectral clustering. Advances in Neural Information Processing Systems 17 (Proc. of NIPS 2004), 1601--1608.Google ScholarGoogle Scholar

Index Terms

  1. Unsupervised content discovery in composite audio

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia
                November 2005
                1110 pages
                ISBN:1595930442
                DOI:10.1145/1101149

                Copyright © 2005 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 6 November 2005

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                MULTIMEDIA '05 Paper Acceptance Rate49of312submissions,16%Overall Acceptance Rate995of4,171submissions,24%

                Upcoming Conference

                MM '24
                MM '24: The 32nd ACM International Conference on Multimedia
                October 28 - November 1, 2024
                Melbourne , VIC , Australia

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader