Intelligent Audio Signal Processing – Do We Still Need Annotated Datasets?

Kostek, Bozena

doi:10.1007/978-3-031-21967-2_55

Bozena Kostek ORCID: orcid.org/0000-0001-6288-2908¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13758))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

682 Accesses

Abstract

In this paper, intelligent audio signal processing examples are shortly described. The focus is, however, on the machine learning approach and datasets needed, especially for deep learning models. Years of intense research produced many important results in this area; however, the goal of fully intelligent signal processing, characterized by its autonomous acting, is not yet achieved. Therefore, a review of state-of-the-art concerning this area is given. The aspect of showing the importance of acquiring an appropriate dataset containing audio samples dedicated to the task is also shown. The paper starts with samples of audio-related datasets resulting from the search engine inquiry. Then, examples of research studies along with results are given. Also, several works carried out by the author and her collaborators are presented. Some thoughts on future work are included with answering a question of whether annotated datasets are still needed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Candel, D., Ñanculef, R., Concha, C., Allende, H.: A sequential minimal optimization algorithm for the all-distances support vector machine. In: Bloch, I., Cesar, R.M. (eds.) CIARP 2010. LNCS, vol. 6419, pp. 484–491. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16687-7_64
Chapter Google Scholar
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: KDD 2016: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016). https://doi.org/10.1145/2939672.2939785 v
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29, 139–164 (1997)
Article MATH Google Scholar
Yiu, T.: Understanding random forest. How the algorithm works and why it is so effective, towards data science. https://towardsdatascience.com/understanding-random-forest-58381e0602d2. Accessed 21 June 2022
Classification: ROC curve and AUC machine learning crash course google developers. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc. Accessed 21 June 2022
Narkhede, S.: Understanding AUC – ROC curve, towards data science. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5. Accessed 21 June 2022
Cao, X., Cai, Y., Cui, X.: A parallel numerical acoustic simulation on a GPU using an edge-based smoothed finite element method. Adv. Eng. Softw. 148 (2020). https://doi.org/10.1016/j.advengsoft.2020.102835. Accessed 21 June 2022
Bianco, M., et al.: Machine learning in acoustics: theory and applications. J. Acoust. Soc. Am. 146(5), 3590 (2019). https://doi.org/10.1121/1.5133944
Article Google Scholar
Tang, Z., Bryan, N., Li, D., Langlois, T., Manocha, D.: Scene-aware audio rendering via deep acoustic analysis. IEEE Trans. Vis. Comput. Graph. 26(5), 1991–2001 (2019). https://doi.org/10.1109/TVCG.2020.2973058
Huang, P., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015). https://doi.org/10.1109/TASLP.2015.2468583
Article Google Scholar
Kurowski, A., Zaporowski, S., Czyżewski, A.: Automatic labeling of traffic sound recordings using autoencoder-derived features. In: 2019 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, pp. 38–43 (2019). https://doi.org/10.23919/SPA.2019.8936709
Naranjo-Alcazar, J., Perez-Castanos, S., Zuccarello, P., Cobos, M.: Acoustic scene classification with squeeze-excitation residual networks. IEEE Access 8, 112287–112296 (2020). https://doi.org/10.1109/ACCESS.2020.3002761
Article Google Scholar
Shen, Y., Cao, J., Wang, J., Yang, Z.: Urban acoustic classification based on deep feature transfer learning. J. Franklin Inst. 357(1), 667–686 (2020). https://doi.org/10.1016/j.jfranklin.2019.10.014
Article MATH Google Scholar
Valada, A., Spinello, L., Burgard, W.: Deep feature learning for acoustics-based terrain classification. In: Bicchi, A., Burgard, W. (eds.) Robotics Research. SPAR, vol. 3, pp. 21–37. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-60916-4_2
Chapter Google Scholar
Avramidis, K., Kratimenos, A., Garoufis, C., Zlatintsi, A., Maragos, P.: Deep convolutional and recurrent networks for polyphonic instrument classification from monophonic raw audio waveforms. In: Proceedings of the 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), Toronto, Canada, 6–11 June 2021, pp. 3010–3014 (2021). https://doi.org/10.48550/arXiv.2102.06930
Thoma, M.: Creativity in machine learning, ArXiv preprint no. 1601.03642 (2016). https://arxiv.org/abs/1601.03642. Accessed 21 June 2022
Kurowski, A., Kostek, B.: Reinforcement learning algorithm and FDTD-based simulation applied to schroeder diffuser design optimization. IEEE Access 9, 136004–136017 (2021). https://doi.org/10.1109/access.2021.311462
Article Google Scholar
Buduma, N., Locasio, N.: Fundamentals of Deep Learning. Designing Next-Generation Machine Intelligence Algorithms. O’Reilly Media, Inc. (2017)
Google Scholar
The Functional API: https://keras.io/guides/functional_api/. Accessed 21 June 2022
Lerch, A., Knees P.: Machine learning applied to music/audio signal processing. Electronics 10(24), 3077 (2021). https://doi.org/10.3390/electronics10243077
Zhang, X., Yu, Y., Gao, Y., Chen, X., Li, W.: Research on singing voice detection based on a long-term recurrent convolutional network with vocal separation and temporal smoothing. Electronics 9, 1458 (2020)
Article Google Scholar
Krause, M., Müller, M., Weiß, C.: Singing voice detection in opera recordings: a case study on robustness and generalization. Electronics 10, 1214 (2021)
Article Google Scholar
Gao, Y., Zhang, X., Li, W.: Vocal melody extraction via HRNet-based singing voice separation and encoder-decoder-based F0 estimation. Electronics 10, 298 (2021)
Article Google Scholar
Abeßer, J., Müller, M.: Jazz Bass transcription using a U-net architecture. Electronics 10, 670 (2021)
Article Google Scholar
Taenzer, M., Mimilakis, S.I., Abeßer, J.: Informing piano multi-pitch estimation with inferred local polyphony based on convolutional neural networks. Electronics 10, 851 (2021)
Article Google Scholar
Hernandez-Olivan, C., Zay Pinilla, I., Hernandez-Lopez, C., Beltran, J.R.: A comparison of deep learning methods for timbre analysis in polyphonic automatic music transcription. Electronics 10, 810 (2021)
Article Google Scholar
Vande Veire, L., De Boom, C., De Bie, T.: Sigmoidal NMFD: convolutional NMF with saturating activations for drum mixture decomposition. Electronics 10, 284 (2021)
Article Google Scholar
Pinto, A.S., Böck, S., Cardoso, J.S., Davies, M.E.P.: User-driven fine-tuning for beat tracking. Electronics 10, 1518 (2021)
Article Google Scholar
Carsault, T., Nika, J., Esling, P., Assayag, G.: Combining real-time extraction and prediction of musical chord progressions for creative applications. Electronics 10, 2634 (2021)
Article Google Scholar
Lattner, S., Nistal, J.: Stochastic restoration of heavily compressed musical audio using generative adversarial networks. Electronics 10, 1349 (2021)
Article Google Scholar
Venkatesh, S., Moffat, D., Miranda, E.R.: Investigating the effects of training set synthesis for audio segmentation of radio broadcast. Electronics 10, 827 (2021)
Article Google Scholar
Grollmisch, S., Cano, E.: Improving semi-supervised learning for audio classification with FixMatch. Electronics 10, 1807 (2021)
Article Google Scholar
Zinemanas, P., Rocamora, M., Miron, M., Font, F., Serra, X.: An interpretable deep learning model for automatic sound classification. Electronics 10, 850 (2021)
Article Google Scholar
Krug, A., Ebrahimzadeh, M., Alemann, J., Johannsmeier, J., Stober, S.: Analyzing and visualizing deep neural networks for speech recognition with saliency-adjusted neuron activation profiles. Electronics 10, 1350 (2021)
Article Google Scholar
Zeng, T., Lau, F.C.M.: Automatic melody harmonization via reinforcement learning by exploring structured representations for melody sequences. Electronics 10, 2469 (2021)
Article Google Scholar
Kostek, B., et al.: Report of the ISMIS 2011 contest: music information retrieval. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS (LNAI), vol. 6804, pp. 715–724. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21916-0_75
Chapter Google Scholar
Kostek, B.: Music information retrieval in music repositories. In: Rough Sets and Intelligent Systems - Professor Zdzisław Pawlak in Memoriam, vol. 1, pp. 464–489 (2013). https://doi.org/10.1007/978-3-642-30344-9_17
Czyzewski, A., Kostek, B., Bratoszewski, P., Kotus, J., Szykulski, M.: An audio-visual corpus for multimodal automatic speech recognition. J. Intell. Inf. Syst. 49(2), 167–192 (2017). https://doi.org/10.1007/s10844-016-0438-z
Article Google Scholar
Haq, P., Jackson, J.E.: Speaker-dependent audio-visual emotion recognition. In: AVSP, Norwich, UK, pp. 53–58, September 2009
Google Scholar
Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018). https://doi.org/10.1371/journal.pone.0196391
Article Google Scholar
Dupuis, M.K.P.K.: Toronto emotional speech set (TESS) (2010). https://tspace.library.utoronto.ca/handle/1807/24487. Accessed 21 May 2022
Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the ACM International Conference on Multimedia, pp. 1015–1018. ACM (2015)
Google Scholar
https://towardsdatascience.com/40-open-source-audio-datasets-for-ml-59dc39d48f06. Accessed 21 May 2022
https://towardsdatascience.com/a-data-lakes-worth-of-audio-datasets-b45b88cd4ad. Accessed 21 May 2022
https://paperswithcode.com/datasets?mod=audio. Accessed 21 May 2022
https://www.twine.net/blog/100-audio-and-video-datasets/. Accessed 21 May 2022
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780 (2017). https://doi.org/10.1109/ICASSP.2017.7952261
Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings of the ACM International Conference on Multimedia, pp. 1041–1044. ACM (2014)
Google Scholar
Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: 24th European Signal Processing Conference (EUSIPCO), pp. 1128–1132 (2016). https://doi.org/10.1109/EUSIPCO.2016.7760424
Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D.: Detection and classification of acoustic scenes and events. IEEE Trans. Multimedia 17(10), 1733–1746 (2015). https://doi.org/10.1109/TMM.2015.2428998
Article Google Scholar
Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X.: FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process. 30 (2022). https://arxiv.org/pdf/2010.00475.pdf
Hershey, S., et al.: The benefit of temporally-strong labels in audio event classification. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021)
Google Scholar
Foster, P., Sigtia, S., Krstulovic, S., Barker, J., Plumbley, M.D.: CHiME-home: a dataset for sound source recognition in a domestic environment. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE (2015)
Google Scholar
Kostek, B., Plewa, M.: Parametrisation and correlation analysis applied to music mood classification. Int. J. Comput. Intell. Stud. 2(1), 4–25 (2013)
Google Scholar
Ciborowski, T., Reginis, S., Kurowski, A., Weber, D., Kostek, B.: Classifying emotions in film music - a deep learning approach. Electronics 10, 2955v (2021). https://doi.org/10.3390/electronics10232955
Article Google Scholar
Dorochowicz, A., Kurowski, A., Kostek, B.: Employing subjective tests and deep learning for discovering the relationship between personality types and preferred music genres. Electronics 9, 2016 (2020). https://doi.org/10.3390/electronics9122016
Article Google Scholar
Rosner, A., Kostek, B.: Automatic music genre classification based on musical instrument track separation. J. Intell. Inf. Syst. 50(2), 363–384 (2017). https://doi.org/10.1007/s10844-017-0464-5
Article Google Scholar
Blaszke, M., Kostek, B.: Musical instrument identification using deep learning approach. Sensors 22, 3033 (2022). https://doi.org/10.3390/s22083033
Article Google Scholar
Korzekwa, D., et al.: Detection of lexical stress errors in non-native (L2) English with data augmentation and attention (2021). https://doi.org/10.21437/interspeech.2021-86
Korvel, G., Treigys, P., Tamulevicus, G., Bernataviciene, J., Kostek, B.: Analysis of 2D feature spaces for deep learning-based speech recognition. J. Audio Eng. Soc. 66(12), 1072–1081 (2018). https://doi.org/10.17743/jaes.2018.0066
Korvel, G., Treigys, P., Kostek, B.: Highlighting interlanguage phoneme differences based on similarity matrices and convolutional neural network. J. Acoust. Soc. Am. 149, 508–523 (2021). https://doi.org/10.1121/10.0003339
Article Google Scholar
Tamulevicius, G., Korvel, G., Yayak, A.B., Treigys, P., Bernataviciene, J., Kostek, B.: A study of cross-linguistic speech emotion recognition based on 2D feature spaces. Electronics 9, 1725 (2020). https://doi.org/10.3390/electronics9101725
Article Google Scholar
Kurowski, A., Marciniuk, K.B.: Separability assessment of selected types of vehicle-associated noise. In: MISSI 2016, pp. 113–121 (2016)
Google Scholar
Odya, P., Kotus, J., Kurowski, A., Kostek, B.: Acoustic sensing analytics applied to speech in reverberation conditions. Sensors 21, 6320 (2021). https://doi.org/10.3390/s21186320
Article Google Scholar
Slakh Demo Site for the Synthesized Lakh Dataset (Slakh). http://www.slakh.com/. Accessed 20 June 2022
Żwan, P., Kostek, B.: System for automatic singing voice recognition. J. Audio Eng. Soc. 56(9), 710–723 (2008)
Google Scholar
Lech, M., Kostek, B., Czyzewski, A.: Examining classifiers applied to static hand gesture recognition in novel sound mixing system. MISSI 2012, 77–86 (2012)
Google Scholar
Korvel, G., Kąkol, K., Kurasova, O., Kostek, B.: Evaluation of lombard speech models in the context of speech in noise enhancement. IEEE Access 8, 155156–155170 (2020). https://doi.org/10.1109/access.2020.3015421
Article Google Scholar
Ezzerg, A., et al.: Enhancing audio quality for expressive neural text-to-speech. In: Proceedings 11th ISCA Speech Synthesis Workshop (SSW 11), pp. 78–83 (2021). https://doi.org/10.21437/SSW.2021-14
AlBadawy, E.A., Lyu, S.: Voice conversion using speech-to-speech neuro-style transfer. Proc. Interspeech 2020, 4726–4730 (2020). https://doi.org/10.21437/Interspeech.2020-3056
Article Google Scholar
Cífka, O., Şimşekli, U.G., Richard, G.: Groove2Groove: one-shot music style transfer with supervision from synthetic data. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2638–2650 (2020). https://doi.org/10.1109/TASLP.2020.3019642
Mukherjee, S., Mulimani, M.: ComposeInStyle: music composition with and without style transfer. Expert Syst. Appl. 191, 116195 (2022). https://doi.org/10.1016/j.eswa.2021.116195
Article Google Scholar
Korzekwa, D., Lorenzo-Trueba, J., Drugman, T., Kostek, B.: Computer-assisted pronunciation training—speech synthesis is almost all you need. Speech Commun. 142, 22–33 (2022). https://doi.org/10.1016/j.specom.2022.06.003
Article Google Scholar

Download references

Author information

Authors and Affiliations

Gdansk University of Technology, G. Narutowicza 11/12, 81-230, Gdansk, Poland
Bozena Kostek

Authors

Bozena Kostek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bozena Kostek .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Vietnam National University, Ho Chi Minh City, Ho Chi Minh City, Vietnam
Tien Khoa Tran
Al-Farabi Kazakh National University, Almaty, Kazakhstan
Ualsher Tukayev
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński
University of Newcastle, Newcastle, NSW, Australia
Edward Szczerbicki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kostek, B. (2022). Intelligent Audio Signal Processing – Do We Still Need Annotated Datasets?. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_55

Download citation

DOI: https://doi.org/10.1007/978-3-031-21967-2_55
Published: 09 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21966-5
Online ISBN: 978-3-031-21967-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Intelligent Audio Signal Processing – Do We Still Need Annotated Datasets?