Skip to main content

Intelligent Audio Signal Processing – Do We Still Need Annotated Datasets?

  • Conference paper
  • First Online:
Intelligent Information and Database Systems (ACIIDS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13758))

Included in the following conference series:

  • 682 Accesses

Abstract

In this paper, intelligent audio signal processing examples are shortly described. The focus is, however, on the machine learning approach and datasets needed, especially for deep learning models. Years of intense research produced many important results in this area; however, the goal of fully intelligent signal processing, characterized by its autonomous acting, is not yet achieved. Therefore, a review of state-of-the-art concerning this area is given. The aspect of showing the importance of acquiring an appropriate dataset containing audio samples dedicated to the task is also shown. The paper starts with samples of audio-related datasets resulting from the search engine inquiry. Then, examples of research studies along with results are given. Also, several works carried out by the author and her collaborators are presented. Some thoughts on future work are included with answering a question of whether annotated datasets are still needed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Candel, D., Ñanculef, R., Concha, C., Allende, H.: A sequential minimal optimization algorithm for the all-distances support vector machine. In: Bloch, I., Cesar, R.M. (eds.) CIARP 2010. LNCS, vol. 6419, pp. 484–491. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16687-7_64

    Chapter  Google Scholar 

  2. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: KDD 2016: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016). https://doi.org/10.1145/2939672.2939785 v

  3. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29, 139–164 (1997)

    Article  MATH  Google Scholar 

  4. Yiu, T.: Understanding random forest. How the algorithm works and why it is so effective, towards data science. https://towardsdatascience.com/understanding-random-forest-58381e0602d2. Accessed 21 June 2022

  5. Classification: ROC curve and AUC machine learning crash course google developers. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc. Accessed 21 June 2022

  6. Narkhede, S.: Understanding AUC – ROC curve, towards data science. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5. Accessed 21 June 2022

  7. Cao, X., Cai, Y., Cui, X.: A parallel numerical acoustic simulation on a GPU using an edge-based smoothed finite element method. Adv. Eng. Softw. 148 (2020). https://doi.org/10.1016/j.advengsoft.2020.102835. Accessed 21 June 2022

  8. Bianco, M., et al.: Machine learning in acoustics: theory and applications. J. Acoust. Soc. Am. 146(5), 3590 (2019). https://doi.org/10.1121/1.5133944

    Article  Google Scholar 

  9. Tang, Z., Bryan, N., Li, D., Langlois, T., Manocha, D.: Scene-aware audio rendering via deep acoustic analysis. IEEE Trans. Vis. Comput. Graph. 26(5), 1991–2001 (2019). https://doi.org/10.1109/TVCG.2020.2973058

  10. Huang, P., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015). https://doi.org/10.1109/TASLP.2015.2468583

    Article  Google Scholar 

  11. Kurowski, A., Zaporowski, S., Czyżewski, A.: Automatic labeling of traffic sound recordings using autoencoder-derived features. In: 2019 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, pp. 38–43 (2019). https://doi.org/10.23919/SPA.2019.8936709

  12. Naranjo-Alcazar, J., Perez-Castanos, S., Zuccarello, P., Cobos, M.: Acoustic scene classification with squeeze-excitation residual networks. IEEE Access 8, 112287–112296 (2020). https://doi.org/10.1109/ACCESS.2020.3002761

    Article  Google Scholar 

  13. Shen, Y., Cao, J., Wang, J., Yang, Z.: Urban acoustic classification based on deep feature transfer learning. J. Franklin Inst. 357(1), 667–686 (2020). https://doi.org/10.1016/j.jfranklin.2019.10.014

    Article  MATH  Google Scholar 

  14. Valada, A., Spinello, L., Burgard, W.: Deep feature learning for acoustics-based terrain classification. In: Bicchi, A., Burgard, W. (eds.) Robotics Research. SPAR, vol. 3, pp. 21–37. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-60916-4_2

    Chapter  Google Scholar 

  15. Avramidis, K., Kratimenos, A., Garoufis, C., Zlatintsi, A., Maragos, P.: Deep convolutional and recurrent networks for polyphonic instrument classification from monophonic raw audio waveforms. In: Proceedings of the 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), Toronto, Canada, 6–11 June 2021, pp. 3010–3014 (2021). https://doi.org/10.48550/arXiv.2102.06930

  16. Thoma, M.: Creativity in machine learning, ArXiv preprint no. 1601.03642 (2016). https://arxiv.org/abs/1601.03642. Accessed 21 June 2022

  17. Kurowski, A., Kostek, B.: Reinforcement learning algorithm and FDTD-based simulation applied to schroeder diffuser design optimization. IEEE Access 9, 136004–136017 (2021). https://doi.org/10.1109/access.2021.311462

    Article  Google Scholar 

  18. Buduma, N., Locasio, N.: Fundamentals of Deep Learning. Designing Next-Generation Machine Intelligence Algorithms. O’Reilly Media, Inc. (2017)

    Google Scholar 

  19. The Functional API: https://keras.io/guides/functional_api/. Accessed 21 June 2022

  20. Lerch, A., Knees P.: Machine learning applied to music/audio signal processing. Electronics 10(24), 3077 (2021). https://doi.org/10.3390/electronics10243077

  21. Zhang, X., Yu, Y., Gao, Y., Chen, X., Li, W.: Research on singing voice detection based on a long-term recurrent convolutional network with vocal separation and temporal smoothing. Electronics 9, 1458 (2020)

    Article  Google Scholar 

  22. Krause, M., Müller, M., Weiß, C.: Singing voice detection in opera recordings: a case study on robustness and generalization. Electronics 10, 1214 (2021)

    Article  Google Scholar 

  23. Gao, Y., Zhang, X., Li, W.: Vocal melody extraction via HRNet-based singing voice separation and encoder-decoder-based F0 estimation. Electronics 10, 298 (2021)

    Article  Google Scholar 

  24. Abeßer, J., Müller, M.: Jazz Bass transcription using a U-net architecture. Electronics 10, 670 (2021)

    Article  Google Scholar 

  25. Taenzer, M., Mimilakis, S.I., Abeßer, J.: Informing piano multi-pitch estimation with inferred local polyphony based on convolutional neural networks. Electronics 10, 851 (2021)

    Article  Google Scholar 

  26. Hernandez-Olivan, C., Zay Pinilla, I., Hernandez-Lopez, C., Beltran, J.R.: A comparison of deep learning methods for timbre analysis in polyphonic automatic music transcription. Electronics 10, 810 (2021)

    Article  Google Scholar 

  27. Vande Veire, L., De Boom, C., De Bie, T.: Sigmoidal NMFD: convolutional NMF with saturating activations for drum mixture decomposition. Electronics 10, 284 (2021)

    Article  Google Scholar 

  28. Pinto, A.S., Böck, S., Cardoso, J.S., Davies, M.E.P.: User-driven fine-tuning for beat tracking. Electronics 10, 1518 (2021)

    Article  Google Scholar 

  29. Carsault, T., Nika, J., Esling, P., Assayag, G.: Combining real-time extraction and prediction of musical chord progressions for creative applications. Electronics 10, 2634 (2021)

    Article  Google Scholar 

  30. Lattner, S., Nistal, J.: Stochastic restoration of heavily compressed musical audio using generative adversarial networks. Electronics 10, 1349 (2021)

    Article  Google Scholar 

  31. Venkatesh, S., Moffat, D., Miranda, E.R.: Investigating the effects of training set synthesis for audio segmentation of radio broadcast. Electronics 10, 827 (2021)

    Article  Google Scholar 

  32. Grollmisch, S., Cano, E.: Improving semi-supervised learning for audio classification with FixMatch. Electronics 10, 1807 (2021)

    Article  Google Scholar 

  33. Zinemanas, P., Rocamora, M., Miron, M., Font, F., Serra, X.: An interpretable deep learning model for automatic sound classification. Electronics 10, 850 (2021)

    Article  Google Scholar 

  34. Krug, A., Ebrahimzadeh, M., Alemann, J., Johannsmeier, J., Stober, S.: Analyzing and visualizing deep neural networks for speech recognition with saliency-adjusted neuron activation profiles. Electronics 10, 1350 (2021)

    Article  Google Scholar 

  35. Zeng, T., Lau, F.C.M.: Automatic melody harmonization via reinforcement learning by exploring structured representations for melody sequences. Electronics 10, 2469 (2021)

    Article  Google Scholar 

  36. Kostek, B., et al.: Report of the ISMIS 2011 contest: music information retrieval. In: Kryszkiewicz, M., Rybinski, H., Skowron, A., Raś, Z.W. (eds.) ISMIS 2011. LNCS (LNAI), vol. 6804, pp. 715–724. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21916-0_75

    Chapter  Google Scholar 

  37. Kostek, B.: Music information retrieval in music repositories. In: Rough Sets and Intelligent Systems - Professor Zdzisław Pawlak in Memoriam, vol. 1, pp. 464–489 (2013). https://doi.org/10.1007/978-3-642-30344-9_17

  38. Czyzewski, A., Kostek, B., Bratoszewski, P., Kotus, J., Szykulski, M.: An audio-visual corpus for multimodal automatic speech recognition. J. Intell. Inf. Syst. 49(2), 167–192 (2017). https://doi.org/10.1007/s10844-016-0438-z

    Article  Google Scholar 

  39. Haq, P., Jackson, J.E.: Speaker-dependent audio-visual emotion recognition. In: AVSP, Norwich, UK, pp. 53–58, September 2009

    Google Scholar 

  40. Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018). https://doi.org/10.1371/journal.pone.0196391

    Article  Google Scholar 

  41. Dupuis, M.K.P.K.: Toronto emotional speech set (TESS) (2010). https://tspace.library.utoronto.ca/handle/1807/24487. Accessed 21 May 2022

  42. Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the ACM International Conference on Multimedia, pp. 1015–1018. ACM (2015)

    Google Scholar 

  43. https://towardsdatascience.com/40-open-source-audio-datasets-for-ml-59dc39d48f06. Accessed 21 May 2022

  44. https://towardsdatascience.com/a-data-lakes-worth-of-audio-datasets-b45b88cd4ad. Accessed 21 May 2022

  45. https://paperswithcode.com/datasets?mod=audio. Accessed 21 May 2022

  46. https://www.twine.net/blog/100-audio-and-video-datasets/. Accessed 21 May 2022

  47. Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780 (2017). https://doi.org/10.1109/ICASSP.2017.7952261

  48. Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings of the ACM International Conference on Multimedia, pp. 1041–1044. ACM (2014)

    Google Scholar 

  49. Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: 24th European Signal Processing Conference (EUSIPCO), pp. 1128–1132 (2016). https://doi.org/10.1109/EUSIPCO.2016.7760424

  50. Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D.: Detection and classification of acoustic scenes and events. IEEE Trans. Multimedia 17(10), 1733–1746 (2015). https://doi.org/10.1109/TMM.2015.2428998

    Article  Google Scholar 

  51. Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X.: FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process. 30 (2022). https://arxiv.org/pdf/2010.00475.pdf

  52. Hershey, S., et al.: The benefit of temporally-strong labels in audio event classification. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021)

    Google Scholar 

  53. Foster, P., Sigtia, S., Krstulovic, S., Barker, J., Plumbley, M.D.: CHiME-home: a dataset for sound source recognition in a domestic environment. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE (2015)

    Google Scholar 

  54. Kostek, B., Plewa, M.: Parametrisation and correlation analysis applied to music mood classification. Int. J. Comput. Intell. Stud. 2(1), 4–25 (2013)

    Google Scholar 

  55. Ciborowski, T., Reginis, S., Kurowski, A., Weber, D., Kostek, B.: Classifying emotions in film music - a deep learning approach. Electronics 10, 2955v (2021). https://doi.org/10.3390/electronics10232955

    Article  Google Scholar 

  56. Dorochowicz, A., Kurowski, A., Kostek, B.: Employing subjective tests and deep learning for discovering the relationship between personality types and preferred music genres. Electronics 9, 2016 (2020). https://doi.org/10.3390/electronics9122016

    Article  Google Scholar 

  57. Rosner, A., Kostek, B.: Automatic music genre classification based on musical instrument track separation. J. Intell. Inf. Syst. 50(2), 363–384 (2017). https://doi.org/10.1007/s10844-017-0464-5

    Article  Google Scholar 

  58. Blaszke, M., Kostek, B.: Musical instrument identification using deep learning approach. Sensors 22, 3033 (2022). https://doi.org/10.3390/s22083033

    Article  Google Scholar 

  59. Korzekwa, D., et al.: Detection of lexical stress errors in non-native (L2) English with data augmentation and attention (2021). https://doi.org/10.21437/interspeech.2021-86

  60. Korvel, G., Treigys, P., Tamulevicus, G., Bernataviciene, J., Kostek, B.: Analysis of 2D feature spaces for deep learning-based speech recognition. J. Audio Eng. Soc. 66(12), 1072–1081 (2018). https://doi.org/10.17743/jaes.2018.0066

  61. Korvel, G., Treigys, P., Kostek, B.: Highlighting interlanguage phoneme differences based on similarity matrices and convolutional neural network. J. Acoust. Soc. Am. 149, 508–523 (2021). https://doi.org/10.1121/10.0003339

    Article  Google Scholar 

  62. Tamulevicius, G., Korvel, G., Yayak, A.B., Treigys, P., Bernataviciene, J., Kostek, B.: A study of cross-linguistic speech emotion recognition based on 2D feature spaces. Electronics 9, 1725 (2020). https://doi.org/10.3390/electronics9101725

    Article  Google Scholar 

  63. Kurowski, A., Marciniuk, K.B.: Separability assessment of selected types of vehicle-associated noise. In: MISSI 2016, pp. 113–121 (2016)

    Google Scholar 

  64. Odya, P., Kotus, J., Kurowski, A., Kostek, B.: Acoustic sensing analytics applied to speech in reverberation conditions. Sensors 21, 6320 (2021). https://doi.org/10.3390/s21186320

    Article  Google Scholar 

  65. Slakh Demo Site for the Synthesized Lakh Dataset (Slakh). http://www.slakh.com/. Accessed 20 June 2022

  66. Żwan, P., Kostek, B.: System for automatic singing voice recognition. J. Audio Eng. Soc. 56(9), 710–723 (2008)

    Google Scholar 

  67. Lech, M., Kostek, B., Czyzewski, A.: Examining classifiers applied to static hand gesture recognition in novel sound mixing system. MISSI 2012, 77–86 (2012)

    Google Scholar 

  68. Korvel, G., Kąkol, K., Kurasova, O., Kostek, B.: Evaluation of lombard speech models in the context of speech in noise enhancement. IEEE Access 8, 155156–155170 (2020). https://doi.org/10.1109/access.2020.3015421

    Article  Google Scholar 

  69. Ezzerg, A., et al.: Enhancing audio quality for expressive neural text-to-speech. In: Proceedings 11th ISCA Speech Synthesis Workshop (SSW 11), pp. 78–83 (2021). https://doi.org/10.21437/SSW.2021-14

  70. AlBadawy, E.A., Lyu, S.: Voice conversion using speech-to-speech neuro-style transfer. Proc. Interspeech 2020, 4726–4730 (2020). https://doi.org/10.21437/Interspeech.2020-3056

    Article  Google Scholar 

  71. Cífka, O., Şimşekli, U.G., Richard, G.: Groove2Groove: one-shot music style transfer with supervision from synthetic data. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2638–2650 (2020). https://doi.org/10.1109/TASLP.2020.3019642

  72. Mukherjee, S., Mulimani, M.: ComposeInStyle: music composition with and without style transfer. Expert Syst. Appl. 191, 116195 (2022). https://doi.org/10.1016/j.eswa.2021.116195

    Article  Google Scholar 

  73. Korzekwa, D., Lorenzo-Trueba, J., Drugman, T., Kostek, B.: Computer-assisted pronunciation training—speech synthesis is almost all you need. Speech Commun. 142, 22–33 (2022). https://doi.org/10.1016/j.specom.2022.06.003

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bozena Kostek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kostek, B. (2022). Intelligent Audio Signal Processing – Do We Still Need Annotated Datasets?. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21967-2_55

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21966-5

  • Online ISBN: 978-3-031-21967-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics