ABSTRACT
Advancements and easier access to technology has led to a greater demand for applications whose interaction is performed through voice recognition, since multimedia content has been a valuable source for computational analysis. In this sense, vocal representations are extracted for various purposes in applications in several areas such as convenience, accessibility, security and sentiment analysis. The main challenge of speech recognition lies in the variability of speakers, environments, devices and the presence of disfluencies during spoken speech. These aspects influence transcription tools, essential when the user requires interaction through voice, aiming at producing texts from this interaction. In particular, detection of disfluencies can help to identify aspects related to the emotional status of the speaker. This work presents an analysis of text transcription tools, with focus in disfluency detection, encompassing the metrics most used for evaluation and databases used in evaluations in the context of Brazilian Portuguese. An experiment was conducted to evaluate the performance of three tools (IBM Watson, Google Speech and Vosk). The Google Speech tool achieved the best performance with average Word Error Rate of 9.69% for fluent sentences and 17.15% for disfluent sentences, followed by IBM Watson with 11.86% and 23.44% and Vosk with 14.39 % and 22.56% respectively.
- Thales Aguiar de Lima and Márjory Da Costa-Abreu. 2020. A survey on automatic speech recognition systems for Portuguese language and its variations. Computer Speech Language 62 (2020), 101055. Google ScholarDigital Library
- Lavanya B. Babu, Anu George, K R Sreelakshmi, and Leena Mary. 2018. Continuous Speech Recognition System for Malayalam Language Using Kaldi. In 2018 International Conference on Emerging Trends and Innovations In Engineering And Technological Research (ICETIETR). 1--4. Google ScholarCross Ref
- Nguyen Bach and Fei Huang. 2019. Noisy BiLSTM-Based Models for Disfluency Detection. In Proc. Interspeech 2019. 4230--4234. Google ScholarCross Ref
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65--72. https://aclanthology.org/W05-0909Google Scholar
- Dario Bertero, Linlin Wang, Ho Yin Chan, and Pascale Fung. 2015. A comparison between a DNN and a CRF disfluency detection and reconstruction system. In Proc. Interspeech 2015. 844--848. Google ScholarCross Ref
- Adwoa Agyeiwaa Boakye-Yiadom, Mingwei Qin, and Ren Jing. 2021. Research of Automatic Speech Recognition of Asante-Twi Dialect For Translation. In Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering (Xiamen, China) (EITCE 2021). Association for Computing Machinery, New York, NY, USA, 1086--1094. Google ScholarDigital Library
- Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, João Paulo Teixeira, Moacir Antonelli Ponti, and Sandra Aluísio. 2022. TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Language Resources and Evaluation (2022), 1--13.Google Scholar
- Qian Chen, Mengzhe Chen, Bo Li, and Wen Wang. 2020. Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8069--8073. Google ScholarCross Ref
- Eunah Cho, Kevin Kilgour, Jan Niehues, and Alex Waibel. 2015. Combination of NN and CRF models for joint detection of punctuation and disfluencies. In Proc. Interspeech 2015. 3650--3654. Google ScholarCross Ref
- Frederico Santos de Oliveira, Anderson da Silva Soares, and Arnaldo Candido Junior. [n.d.]. Brazilian Portuguese Speech Recognition Using Wav2vec 2.0. In Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21--23, 2022, Proceedings. Springer Nature, 333.Google Scholar
- Kallirroi Georgila, Anton Leuski, Volodymyr Yanov, and David Traum. 2020. Evaluation of Off-the-shelf Speech Recognizers Across Diverse Dialogue Domains. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 6469--6476. https://www.aclweb.org/anthology/2020.lrec-1.797Google Scholar
- Nathan S. Hartmann, Erick R. Fonseca, Christopher D. Shulby, Marcos V. Treviso, Jéssica S. Rodrigues, and Sandra M. Aluísio. 2017. Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. In Anais do XI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (Minas Gerais). SBC, Porto Alegre, RS, Brasil, 122--131. https://sol.sbc.org.br/index.php/stil/article/view/4008Google Scholar
- Ben Haynor and Petar S. Aleksic. 2020. Incorporating Written Domain Numeric Grammars into End-To-End Contextual Speech Recognition Systems for Improved Recognition of Numeric Sequences. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7809--7813. Google ScholarCross Ref
- Paria Jamshid Lou, Peter Anderson, and Mark Johnson. 2018. Disfluency Detection using Auto-Correlational Neural Networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4610--4619. Google ScholarCross Ref
- V. Kepuska and G. Bohouta. 2017. Comparing speech recognition systems (Microsoft API, Google API and CMU Sphinx). International Journal of Engineering Research and Application 7, 3 (2017), 20--24.Google ScholarCross Ref
- Arvind Kumar, Rampravesh Kumar, and Kamlesh Kishore. 2020. Performance analysis of ASR Model for Santhali language on Kaldi and Matlab Toolkit. In 2020 International Conference on Recent Trends on Electronics, Information, Communication Technology (RTEICT). 88--92. Google ScholarCross Ref
- Yogesh Kumar and Navdeep Singh. 2019. A Comprehensive View of Automatic Speech Recognition System - A Systematic Literature Review. In 2019 International Conference on Automation, Computational and Technology Management (ICACTM). 168--173. Google ScholarCross Ref
- Burhanuddin Lakdawala, Farhan Khan, Arif Khan, Yash Tomar, Rahul Gupta, and Ashfaq Shaikh. 2018. Voice to Text transcription using CMU Sphinx A mobile application for healthcare organization. In 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT). 749--753. Google ScholarCross Ref
- Benjamin Lecouteux, Michel Vacher, and François Portet. 2018. Distant Speech Processing for Smart Home: Comparison of ASR Approaches in Scattered Microphone Network for Voice Command. Int. J. Speech Technol. 21, 3 (sep 2018), 601--618. Google ScholarDigital Library
- K.-F. Lee, H.-W. Hon, and R. Reddy. 1990. An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing 38, 1 (1990), 35--45. Google ScholarCross Ref
- Zhenyu Li, Bin He, Xinguo Yu, and Rong Hu. 2017. Speech Interaction of Educational Robot Based on Ekho and Sphinx. In Proceedings of the 2017 International Conference on Education and Multimedia Technology (Singapore, Singapore) (ICEMT '17). Association for Computing Machinery, New York, NY, USA, 14--20. Google ScholarDigital Library
- Nelson Neto, Carlos Patrick, Aldebaro Klautau, and Isabel Trancoso. 2011. Free tools and resources for Brazilian Portuguese speech recognition. Journal of the Brazilian Computer Society 17, 1 (2011), 53--68.Google ScholarCross Ref
- Arif Nursetyo and De Rosal Ignatius Moses Setiadi. 2018. LatAksLate: Javanese Script Translator based on Indonesian Speech Recognition using Sphinx-4 and Google API. In 2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). 17--22. Google ScholarCross Ref
- Rafael Oliveira, Pedro Batista, Nelson Neto, and Aldebaro Klautau. 2012. Baseline Acoustic Models for Brazilian Portuguese Using CMU Sphinx Tools. In Computational Processing of the Portuguese Language. Springer Berlin Heidelberg, Berlin, Heidelberg, 375--380.Google Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL '02). Association for Computational Linguistics, USA, 311--318. Google ScholarDigital Library
- Rosalind W. Picard. 1997. Affective Computing. MIT Press, Cambridge, MA, USA.Google ScholarDigital Library
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (Hilton Waikoloa Village, Big Island, Hawaii, US). IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.Google Scholar
- Tommaso Raso and Heliana Mello. 2012. The C-ORAL-BRASIL I: Reference Corpus for Informal Spoken Brazilian Portuguese. In Computational Processing of the Portuguese Language. Springer Berlin Heidelberg, Berlin, Heidelberg, 362--367.Google Scholar
- Tânia Rocha, António Marques, José Pedro Brito, Luís Cardoso, Pedro Martins, and João Barroso. 2017. Web application for the training of the correct pronunciation of words in Portuguese for people with speech and language disorders --- preliminary usability study. In 2017 12th Iberian Conference on Information Systems and Technologies (CISTI). 1--7. Google ScholarCross Ref
- Johann C. Rocholl, Vicky Zayats, Daniel D. Walker, Noah B. Murad, Aaron Schneider, and Daniel J. Liebling. 2021. Disfluency Detection with Unlabeled Data and Small BERT Models. In Proc. Interspeech 2021. 766--770. Google ScholarCross Ref
- Morteza Rohanian and Julian Hough. 2020. Re-framing Incremental Deep Language Models for Dialogue Processing with Multi-task Learning. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 497--507. Google ScholarCross Ref
- Matheus Sampaio, Regis Magalhães, Ticiana Silva, Lívia Cruz, Davi Vasconcelos, José Macêdo, and Marianna Ferreira. 2021. Evaluation of Automatic Speech Recognition Systems. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados (Rio de Janeiro). SBC, Porto Alegre, RS, Brasil, 301--306. Google ScholarCross Ref
- Himangshu Sarma, Navanath Saharia, and Utpal Sharma. 2017. Development and Analysis of Speech Recognition Systems for Assamese Language Using HTK. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 17, 1, Article 7 (oct 2017), 14 pages. Google ScholarDigital Library
- Rohit Raj Sehgal, Shubham Agarwal, and Gaurav Raj. 2018. Interactive Voice Response using Sentiment Analysis in Automatic Speech Recognition Systems. In 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE). 213--218. Google ScholarCross Ref
- Puwadol Sirikongtham and Worapat Paireekreng. 2017. Improving speech recognition using dynamic multi-pipeline API. In 2017 15th International Conference on ICT and Knowledge Engineering (ICTKE). 1--6. Google ScholarCross Ref
- V. Sneha, G. Hardhika, K. Jeeva Priya, and Deepa Gupta. 2018. Isolated Kannada Speech Recognition Using HTK---A Detailed Approach. In Progress in Advanced Computing and Intelligent Engineering. Springer Singapore, Singapore, 185--194.Google Scholar
- S. Supriya and S. M. Handore. 2017. Speech recognition using HTK toolkit for Marathi language. In 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI). 1591--1597. Google ScholarCross Ref
- Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. 2002. The HTK book. Cambridge university engineering department 3, 175 (2002), 12.Google Scholar
- Denis Roberto Zamignani and Sonia Beatriz Meyer. 2007. Comportamento verbal no contexto clínico: contribuições metodológicas apartir da análise do comportamento. Revista Brasileira de Terapia Comportamental e Cognitiva 9 (12 2007), 241 -- 259. http://pepsic.bvsalud.org/scielo.php?script=sci_arttext&pid=S1517-55452007000200008&nrm=isoGoogle Scholar
Index Terms
- Analysis of transcription tools for Brazilian Portuguese with focus on disfluency detection
Recommendations
Baseline acoustic models for brazilian portuguese using CMU sphinx tools
PROPOR'12: Proceedings of the 10th international conference on Computational Processing of the Portuguese LanguageAdvances in speech processing research rely on the availability of public resources such as corpora, statistical models and baseline systems. In contrast to languages such as English, there are few specific resources for Brazilian Portuguese. This work ...
Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information
Human-machine interaction in meetings requires the localization and identification of the speakers interacting with the system as well as the recognition of the words spoken. A seminal step toward this goal is the field of rich transcription research, ...
European Portuguese Accent in Acoustic Models for Non-native English Speakers
Progress in Pattern Recognition, Image Analysis and ApplicationsAbstractThe development of automatic speech recognition systems poses several known difficulties. One of them concerns the recognizer’s accuracy when dealing with non-native speakers of a given language. Normally a recognizer precision is lower for non-...
Comments