Abstract
In this paper we study the automatic recognition of sound categories (such as fridge, mixers or sawing sounds) from their vocal imitations. Vocal imitations are made of a succession over time of sounds produced using vocal mechanisms that can largely differ from the ones used in speech. We develop here a recognition approach inspired by automatic-speech-recognition systems, with an acoustic model (that maps the audio signal to a set of probability over “phonemes”) and a language model (that represents the expected succession of “phonemes” for each sound category). Since we do not know what are the underlying “phonemes” of vocal imitations we propose to automatically estimate them using Shift-Invariant Probabilistic Latent Component Analysis (SI-PLCA) applied to a dataset of vocal imitations. The kernel distributions of the SI-PLCA are considered as the “phonemes” of vocal imitation and its impulse distributions are used to compute the emission probabilities of the states of a set of Hidden Markov Models (HMMs). To evaluate our proposal, we test it for a task of automatically recognizing 12 sound categories from their vocal imitations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
In our CQT, harmonics are spaced by 18 bins, so the kernels size in frequency has to be at least 18 to exploit the shift-invariance.
- 3.
One third of the data is used for testing, the remaining for training; each third is used in turns for testing.
- 4.
The same subject can not appear simultaneously in the training and testing set.
References
Baldan, S., Delle Monache, S., Rocchesso, D.: The sound design toolkit. Softw. X 6, 255–260 (2017)
Brown, J.C.: Calculation of a constant Q spectral transform. J. Acoust. Soc. Am. 89(1), 425–434 (1991)
Houix, O., Monache, S.D., Lachambre, H., Bevilacqua, F., Rocchesso, D., Lemaitre, G.: Innovative tools for sound sketching combining vocalizations and gestures. In: Proceedings of the Audio Mostly 2016, pp. 12–10. ACM (2016)
Ishihara, K., Nakatani, T., Ogata, T., Okuno, H.G.: Automatic sound-imitation word recognition from environmental sounds focusing on ambiguity problem in determining phonemes. In: Zhang, C., W. Guesgen, H., Yeap, W.-K. (eds.) PRICAI 2004. LNCS (LNAI), vol. 3157, pp. 909–918. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28633-2_96
Juang, B.H., Rabiner, L.R.: Automatic speech recognition-a brief history of the technology development. Georgia Institute of Technology. Atlanta Rutgers University and the University of California 1:67 (2005)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Lemaitre, G., Dessein, A., Aura, K., Susini, P.: Do vocal imitations enable the identification of the imitated sounds. In: Proceedings of the 8th Annual Auditory Perception, Cognition and Action Meeting (APCAM 2009), Boston, MA (2009)
Lemaitre, G., Houix, O., Voisin, F., Misdariis, N., Susini, P.: Vocal imitations of non-vocal sounds. PLoS ONE 11(12), e0168167 (2016). Public Library of Science
Lemaitre, G., Rocchesso, D.: On the effectiveness of vocal imitations and verbal descriptions of sounds. J. Acoust. Soc. Am. 135(2), 862–873 (2014). http://www.ncbi.nlm.nih.gov/pubmed/25234894
Marchetto, E., Peeters, G.: A set of audio features for the morphological description of vocal imitations. In: Proceedings of DAFx (2015)
Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994). https://doi.org/10.1002/env.3170050203
Peeters, G., Deruty, E.: Sound indexing using morphological description. IEEE Trans. Audio Speech Lang. Process. 18(3), 675–687 (2010)
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Rabiner, L.R., Juang, B.H.: Fundamentals of speech recognition (1993)
Ricard, J., Herrera, P.: Morphological sound description: computational model and usability evaluation. In: Audio Engineering Society Convention 116 (2004)
Saon, G., Chien, J.T.: Large-vocabulary continuous speech recognition systems: a look at some recent advances. IEEE Sig. Process. Mag. 29(6), 18–33 (2012)
Schaeffer, P.: Traité des objets musicaux. Le Seuil (1966)
Schörkhuber, C., Klapuri, A., Holighaus, N., Dörfler, M.: A Matlab toolbox for efficient perfect reconstruction time-frequency transforms with log-frequency resolution. In: Audio Engineering Society Conference: 53rd International Conference: Semantic Audio, January 2014. http://www.aes.org/e-lib/browse.cfm?elib=17112
Shashanka, M., Raj, B., Smaragdis, P.: Probabilistic latent variable models as nonnegative factorizations. Comput. Intell. Neurosci. 2008, 8 (2008). Article ID 947438. https://doi.org/10.1155/2008/947438
Smaragdis, P., Raj, B.: Shift-invariant probabilistic latent component analysis. Technical report, MERL (2007)
Sundaram, S., Narayanan, S.: Vector-based representation and clustering of audio using onomatopoeia words. In: Proceedings of AAAI (2006)
Sundaram, S., Narayanan, S.: Classification of sound clips by two schemes: using onomatopoeia and semantic labels. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 1341–1344. IEEE (2008)
Velasco, G.A., Holighaus, N., Dörfler, M., Grill, T.: Constructing an invertible constant-Q transform with non-stationary Gabor frames. In: Proceedings of DAFx, Paris, pp. 93–99 (2011)
Acknowledgments
This work was supported by the 7th FP of the EU (FP7-ICT-2013-C FET-Future Emerging Technologies) under grant agreement 618067 (SkAT-VG project).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Marchetto, E., Peeters, G. (2018). Automatic Recognition of Sound Categories from Their Vocal Imitation Using Audio Primitives Automatically Found by SI-PLCA and HMM. In: Aramaki, M., Davies , M., Kronland-Martinet, R., Ystad, S. (eds) Music Technology with Swing. CMMR 2017. Lecture Notes in Computer Science(), vol 11265. Springer, Cham. https://doi.org/10.1007/978-3-030-01692-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-01692-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01691-3
Online ISBN: 978-3-030-01692-0
eBook Packages: Computer ScienceComputer Science (R0)