Automatic Recognition of Sound Categories from Their Vocal Imitation Using Audio Primitives Automatically Found by SI-PLCA and HMM

Marchetto, Enrico; Peeters, Geoffroy

doi:10.1007/978-3-030-01692-0_1

Enrico Marchetto¹⁷ &
Geoffroy Peeters¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11265))

Included in the following conference series:

International Symposium on Computer Music Multidisciplinary Research

987 Accesses
2 Citations

Abstract

In this paper we study the automatic recognition of sound categories (such as fridge, mixers or sawing sounds) from their vocal imitations. Vocal imitations are made of a succession over time of sounds produced using vocal mechanisms that can largely differ from the ones used in speech. We develop here a recognition approach inspired by automatic-speech-recognition systems, with an acoustic model (that maps the audio signal to a set of probability over “phonemes”) and a language model (that represents the expected succession of “phonemes” for each sound category). Since we do not know what are the underlying “phonemes” of vocal imitations we propose to automatically estimate them using Shift-Invariant Probabilistic Latent Component Analysis (SI-PLCA) applied to a dataset of vocal imitations. The kernel distributions of the SI-PLCA are considered as the “phonemes” of vocal imitation and its impulse distributions are used to compute the emission probabilities of the states of a set of Hidden Markov Models (HMMs). To evaluate our proposal, we test it for a task of automatically recognizing 12 sound categories from their vocal imitations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://skatvg.iuav.it/.
2.
In our CQT, harmonics are spaced by 18 bins, so the kernels size in frequency has to be at least 18 to exploit the shift-invariance.
3.
One third of the data is used for testing, the remaining for training; each third is used in turns for testing.
4.
The same subject can not appear simultaneously in the training and testing set.

References

Baldan, S., Delle Monache, S., Rocchesso, D.: The sound design toolkit. Softw. X 6, 255–260 (2017)
Google Scholar
Brown, J.C.: Calculation of a constant Q spectral transform. J. Acoust. Soc. Am. 89(1), 425–434 (1991)
Article Google Scholar
Houix, O., Monache, S.D., Lachambre, H., Bevilacqua, F., Rocchesso, D., Lemaitre, G.: Innovative tools for sound sketching combining vocalizations and gestures. In: Proceedings of the Audio Mostly 2016, pp. 12–10. ACM (2016)
Google Scholar
Ishihara, K., Nakatani, T., Ogata, T., Okuno, H.G.: Automatic sound-imitation word recognition from environmental sounds focusing on ambiguity problem in determining phonemes. In: Zhang, C., W. Guesgen, H., Yeap, W.-K. (eds.) PRICAI 2004. LNCS (LNAI), vol. 3157, pp. 909–918. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28633-2_96
Chapter Google Scholar
Juang, B.H., Rabiner, L.R.: Automatic speech recognition-a brief history of the technology development. Georgia Institute of Technology. Atlanta Rutgers University and the University of California 1:67 (2005)
Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Article Google Scholar
Lemaitre, G., Dessein, A., Aura, K., Susini, P.: Do vocal imitations enable the identification of the imitated sounds. In: Proceedings of the 8th Annual Auditory Perception, Cognition and Action Meeting (APCAM 2009), Boston, MA (2009)
Google Scholar
Lemaitre, G., Houix, O., Voisin, F., Misdariis, N., Susini, P.: Vocal imitations of non-vocal sounds. PLoS ONE 11(12), e0168167 (2016). Public Library of Science
Article Google Scholar
Lemaitre, G., Rocchesso, D.: On the effectiveness of vocal imitations and verbal descriptions of sounds. J. Acoust. Soc. Am. 135(2), 862–873 (2014). http://www.ncbi.nlm.nih.gov/pubmed/25234894
Article Google Scholar
Marchetto, E., Peeters, G.: A set of audio features for the morphological description of vocal imitations. In: Proceedings of DAFx (2015)
Google Scholar
Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994). https://doi.org/10.1002/env.3170050203
Article Google Scholar
Peeters, G., Deruty, E.: Sound indexing using morphological description. IEEE Trans. Audio Speech Lang. Process. 18(3), 675–687 (2010)
Article Google Scholar
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
Rabiner, L.R., Juang, B.H.: Fundamentals of speech recognition (1993)
Google Scholar
Ricard, J., Herrera, P.: Morphological sound description: computational model and usability evaluation. In: Audio Engineering Society Convention 116 (2004)
Google Scholar
Saon, G., Chien, J.T.: Large-vocabulary continuous speech recognition systems: a look at some recent advances. IEEE Sig. Process. Mag. 29(6), 18–33 (2012)
Article Google Scholar
Schaeffer, P.: Traité des objets musicaux. Le Seuil (1966)
Google Scholar
Schörkhuber, C., Klapuri, A., Holighaus, N., Dörfler, M.: A Matlab toolbox for efficient perfect reconstruction time-frequency transforms with log-frequency resolution. In: Audio Engineering Society Conference: 53rd International Conference: Semantic Audio, January 2014. http://www.aes.org/e-lib/browse.cfm?elib=17112
Shashanka, M., Raj, B., Smaragdis, P.: Probabilistic latent variable models as nonnegative factorizations. Comput. Intell. Neurosci. 2008, 8 (2008). Article ID 947438. https://doi.org/10.1155/2008/947438
Article Google Scholar
Smaragdis, P., Raj, B.: Shift-invariant probabilistic latent component analysis. Technical report, MERL (2007)
Google Scholar
Sundaram, S., Narayanan, S.: Vector-based representation and clustering of audio using onomatopoeia words. In: Proceedings of AAAI (2006)
Google Scholar
Sundaram, S., Narayanan, S.: Classification of sound clips by two schemes: using onomatopoeia and semantic labels. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 1341–1344. IEEE (2008)
Google Scholar
Velasco, G.A., Holighaus, N., Dörfler, M., Grill, T.: Constructing an invertible constant-Q transform with non-stationary Gabor frames. In: Proceedings of DAFx, Paris, pp. 93–99 (2011)
Google Scholar

Download references

Acknowledgments

This work was supported by the 7th FP of the EU (FP7-ICT-2013-C FET-Future Emerging Technologies) under grant agreement 618067 (SkAT-VG project).

Author information

Authors and Affiliations

UMR STMS 9912 (IRCAM – CNRS – Sorbonne-University), Paris, France
Enrico Marchetto & Geoffroy Peeters

Authors

Enrico Marchetto
View author publications
You can also search for this author in PubMed Google Scholar
Geoffroy Peeters
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Geoffroy Peeters .

Editor information

Editors and Affiliations

Laboratoire PRISM, AMU-CNRS, Marseille, France
Mitsuko Aramaki
INESC TEC, Porto, Portugal
Matthew E. P. Davies
Laboratoire PRISM, AMU-CNRS, Marseille, France
Richard Kronland-Martinet
Laboratoire PRISM, AMU-CNRS, Marseille, France
Sølvi Ystad

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Marchetto, E., Peeters, G. (2018). Automatic Recognition of Sound Categories from Their Vocal Imitation Using Audio Primitives Automatically Found by SI-PLCA and HMM. In: Aramaki, M., Davies , M., Kronland-Martinet, R., Ystad, S. (eds) Music Technology with Swing. CMMR 2017. Lecture Notes in Computer Science(), vol 11265. Springer, Cham. https://doi.org/10.1007/978-3-030-01692-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-01692-0_1
Published: 24 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01691-3
Online ISBN: 978-3-030-01692-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics