MakedonASRDataset - A Dataset for Speech Recognition in the Macedonian Language

Mishev, Martin; Penkova, Blagica; Mitreska, Maja; Kostoska, Magdalena; Todorovska, Ana; Simjanoska, Monika; Mishev, Kostadin

doi:10.1007/978-3-031-54321-0_2

Martin Mishev⁷,
Blagica Penkova⁷,
Maja Mitreska⁷,
Magdalena Kostoska⁷,
Ana Todorovska⁷,
Monika Simjanoska⁷ &
…
Kostadin Mishev⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1991))

Included in the following conference series:

International Conference on ICT Innovations

179 Accesses

Abstract

Using dataset analysis as a research method is becoming more popular among many researchers with diverse data collection and analysis backgrounds. This paper provides the first publicly available dataset consisting of audio segments and appropriate textual transcription in the Macedonian language. It is appropriately preprocessed and prepared for direct utilization in the automatic speech recognition pipelines. The dataset was created by students at the Faculty of Computer Science and Engineering as part of the elective course, ‘Digital Libraries’, with the audio segments sourced from a YouTube channel.

Supported by Faculty of Computer Science and Engineering, Skopje, N. Macedonia.

M. Mishev, B. Penkova, M. Mitreska, M. Kostoska, A, Todorovska, M. Simjanoska, and K. Mishev —Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://drive.google.com/file/d/1ecgz27gzUTzgwu7Lof6l_cM7PQrN2cAV/view?usp=drive_linkMakedonASRDataset.
2.
https://www.audacityteam.org/.

References

Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019)
Besacier, L., Barnard, E., Karpov, A., Schultz, T.: Automatic speech recognition for under-resourced languages: a survey. Speech Commun. 56, 85–100 (2014)
Article Google Scholar
Datasets, H.F.: Oscar: a large-scale multilingual benchmark for evaluating cross-lingual transfer (2021). https://paperswithcode.com/dataset/oscar. Accessed: [date]
Forsberg, M.: Why is speech recognition difficult. Chalmers University of Technology (2003)
Google Scholar
Friedman, V.A.: Macedonian: a south slavic language. In: Kortmann, B., van der Auwera, J. (eds.) The Slavic Languages, pp. 441–483. Cambridge University Press (2019)
Google Scholar
Reddy, D.R.: Speech recognition by machine: a review. Proc. IEEE 64(4), 501–531 (1976)
Article Google Scholar
Rousseau, A., Deléglise, P., Esteve, Y.: Ted-lium: an automatic speech recognition dedicated corpus. In: LREC, pp. 125–129 (2012)
Google Scholar
Sarikaya, R., Hinton, G.E., Weninger, F.F.: Deep neural network language models for speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 22(4), 800–814 (2014)
Google Scholar
Valk, J., Alumäe, T.: Voxlingua107: a dataset for spoken language recognition. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 652–658. IEEE (2021)
Google Scholar

Download references

Acknowledgements

This work was partially financed by the Faculty of Computer Science and Engineering at the Ss. Cyril and Methodius University in Skopje.

Author information

Authors and Affiliations

Faculty of Computer Science and Engineering, Ss Cyril and Methodiuos University, Skopje, North Macedonia
Martin Mishev, Blagica Penkova, Maja Mitreska, Magdalena Kostoska, Ana Todorovska, Monika Simjanoska & Kostadin Mishev

Authors

Martin Mishev
View author publications
You can also search for this author in PubMed Google Scholar
Blagica Penkova
View author publications
You can also search for this author in PubMed Google Scholar
Maja Mitreska
View author publications
You can also search for this author in PubMed Google Scholar
Magdalena Kostoska
View author publications
You can also search for this author in PubMed Google Scholar
Ana Todorovska
View author publications
You can also search for this author in PubMed Google Scholar
Monika Simjanoska
View author publications
You can also search for this author in PubMed Google Scholar
Kostadin Mishev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Mishev .

Editor information

Editors and Affiliations

Ss. Cyril and Methodius University in Skopje, Skopje, North Macedonia
Marija Mihova
Ss. Cyril and Methodius University in Skopje, Skopje, North Macedonia
Mile Jovanov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mishev, M. et al. (2024). MakedonASRDataset - A Dataset for Speech Recognition in the Macedonian Language. In: Mihova, M., Jovanov, M. (eds) ICT Innovations 2023. Learning: Humans, Theory, Machines, and Data. ICT Innovations 2023. Communications in Computer and Information Science, vol 1991. Springer, Cham. https://doi.org/10.1007/978-3-031-54321-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-54321-0_2
Published: 27 February 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54320-3
Online ISBN: 978-3-031-54321-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MakedonASRDataset - A Dataset for Speech Recognition in the Macedonian Language