Abstract
As an essentially multi-label classification problem, audio concept detection is normally solved by treating concepts independently. Since in this process the original useful concept correlation information is missing, this paper proposes a new model named Correlated-Aspect Gaussian Mixture Model (C-AGMM) to take advantage of such a clue for enhancing multi-label audio concept detection. Originating from Aspect Gaussian Mixture Model (AGMM) which improves GMM by incorporating it into probabilistic Latent Semantic Analysis (pLSA), C-AGMM still learns a probabilistic model of the whole audio clip by regarding concepts as its component elements. However, different from AGMM that assumes concepts independent with each other, C-AGMM considers their distribution on a sub-manifold embedded in the ambient space. With an assumption that if two concepts are close in the intrinsic geometry of this distribution then their conditional probability distributions are likely to show similarity, a graph regularizer is exploited to model the correlation between these concepts. Following the Maximum Likelihood Estimate principle, model parameters of C-AGMM encoding the concept correlation clue are derived and used directly as the detection criterion. Experiments on two datasets show the effectiveness of our proposed model.






Similar content being viewed by others
References
Ahrendt P, Larsen J, Goutte C (2005) Co-occurrence models in music genre classification. In: MLSP, pp 247–252
Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. In: NIPS, vol 14, pp 585–591
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 2399–2434
Bertin-Mahieux T, Eck D, Maillet F, Lamere P (2008) Autotagger: a model for predicting social tags from acoustic. J New Music Res 37(2):115–135
Cai D, Mei Q, Han J, Zhai C (2008) Modeling hidden topics on document manifold. In: CIKM, pp 911–920
Chu S, Narayanan S, Jay Kuo C-C (2006) Content analysis for acoustic environment classification in mobile robots. In: AAAI fall symposium aurally informed performance: integrating machine listening and auditory presentation in robotic system, pp 16–21
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc B 39(1):1–38
Ellis DPW, Zeng X, McDermott JH (2011) Classifying soundtracks with audio texture features. In: ICASSP, pp 5880–5883
Eronen AJ, Peltonen VT, Tuomi JT, Klapuri A, Fagerlund S, Sorsa T, Lorho G, Huopaniemi J (2006) Audio-based context recognition. IEEE Trans Audio Speech Lang Process 14(1):321–329
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(2):177–196
Jiang Y, Wang J, Chang S, Ngo C (2009) Domain adaptive semantic diffusion for large scale context-based video annotation. In: ICCV, pp 1420–1427
Juang BH, Rabiner LR (1991) Hidden Markov Models for speech recognition. Technometrics 33(3):251–272
Lee K, Ellis DPW, Loui AC (2010) Detecting local semantic concepts in environmental sounds using Markov model based clustering. In: ICASSP, pp 2278–2281
Lee K, Ellis DPW (2010) Audio-based semantic concept classification for consumer video. IEEE Trans Audio Speech Lang Process 16(6):1406–1416
Li Y, Tian Y, Duan L, Yang J, Huang T, Gao W (2010) Sequence multi-labeling: a unified video annotation scheme with spatial and temporal context. IEEE Trans Multimed 12(8):814–828
Liu J, Cai D, He X (2010) Gaussian Mixture Model with local consistency. In: AAAI, pp 512–517
Ma L, Milner B, Smith D (2006) Acoustic environment classification. ACM Trans Speech Lang Process 3(2):1–22
Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3(1):72–83
Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: CIKM, pp 623–632
Tan CC, Jiang Y, Ngo C (2011) Towards textually describing complex video contents with audio-visual concept classifiers. In: ACM multimedia, pp 655–658
Wang ZY, Wang SF, He MH, Liu ZL, Ji Q (2013) Emotional tagging of videos by exploring multi-emotion coexistence. In: FG, pp 1–6
Wang C, Jing F, Zhang L, Zhang H (2007) Content-based image annotation refinement. In: CVPR, pp 1–8
Zha Z, Mei T, Hua X, Qi G, Wang Z (2007) Refining video annotation by exploiting pairwise concurrent relation. In: ACM multimedia, pp 345–348
Zhang T, Kuo CJ (2001) Audio content analysis for online audiovisual data segmentation and classification. IEEE Trans Speech Audio Process 9(4):441–457
Acknowledgments
This work is supported by the National Science Foundation of China (61273274, 4123104), National 973 Key Research Program of China (2011CB302203), Ph.D. Programs Foundation of Ministry of Education of China (20100009110004), National Key Technology R&D Program of China (2012BAH01F03) and Tsinghua-Tencent Joint Lab for IIT.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhong, C., Miao, Z. Multi-label audio concept detection using correlated-aspect Gaussian Mixture Model. Multimed Tools Appl 74, 4817–4832 (2015). https://doi.org/10.1007/s11042-013-1842-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1842-9