skip to main content
10.1145/3554364.3559112acmotherconferencesArticle/Chapter ViewAbstractPublication PagesihcConference Proceedingsconference-collections
research-article

Analysis of transcription tools for Brazilian Portuguese with focus on disfluency detection

Published:19 October 2022Publication History

ABSTRACT

Advancements and easier access to technology has led to a greater demand for applications whose interaction is performed through voice recognition, since multimedia content has been a valuable source for computational analysis. In this sense, vocal representations are extracted for various purposes in applications in several areas such as convenience, accessibility, security and sentiment analysis. The main challenge of speech recognition lies in the variability of speakers, environments, devices and the presence of disfluencies during spoken speech. These aspects influence transcription tools, essential when the user requires interaction through voice, aiming at producing texts from this interaction. In particular, detection of disfluencies can help to identify aspects related to the emotional status of the speaker. This work presents an analysis of text transcription tools, with focus in disfluency detection, encompassing the metrics most used for evaluation and databases used in evaluations in the context of Brazilian Portuguese. An experiment was conducted to evaluate the performance of three tools (IBM Watson, Google Speech and Vosk). The Google Speech tool achieved the best performance with average Word Error Rate of 9.69% for fluent sentences and 17.15% for disfluent sentences, followed by IBM Watson with 11.86% and 23.44% and Vosk with 14.39 % and 22.56% respectively.

References

  1. Thales Aguiar de Lima and Márjory Da Costa-Abreu. 2020. A survey on automatic speech recognition systems for Portuguese language and its variations. Computer Speech Language 62 (2020), 101055. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Lavanya B. Babu, Anu George, K R Sreelakshmi, and Leena Mary. 2018. Continuous Speech Recognition System for Malayalam Language Using Kaldi. In 2018 International Conference on Emerging Trends and Innovations In Engineering And Technological Research (ICETIETR). 1--4. Google ScholarGoogle ScholarCross RefCross Ref
  3. Nguyen Bach and Fei Huang. 2019. Noisy BiLSTM-Based Models for Disfluency Detection. In Proc. Interspeech 2019. 4230--4234. Google ScholarGoogle ScholarCross RefCross Ref
  4. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65--72. https://aclanthology.org/W05-0909Google ScholarGoogle Scholar
  5. Dario Bertero, Linlin Wang, Ho Yin Chan, and Pascale Fung. 2015. A comparison between a DNN and a CRF disfluency detection and reconstruction system. In Proc. Interspeech 2015. 844--848. Google ScholarGoogle ScholarCross RefCross Ref
  6. Adwoa Agyeiwaa Boakye-Yiadom, Mingwei Qin, and Ren Jing. 2021. Research of Automatic Speech Recognition of Asante-Twi Dialect For Translation. In Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering (Xiamen, China) (EITCE 2021). Association for Computing Machinery, New York, NY, USA, 1086--1094. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, João Paulo Teixeira, Moacir Antonelli Ponti, and Sandra Aluísio. 2022. TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Language Resources and Evaluation (2022), 1--13.Google ScholarGoogle Scholar
  8. Qian Chen, Mengzhe Chen, Bo Li, and Wen Wang. 2020. Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8069--8073. Google ScholarGoogle ScholarCross RefCross Ref
  9. Eunah Cho, Kevin Kilgour, Jan Niehues, and Alex Waibel. 2015. Combination of NN and CRF models for joint detection of punctuation and disfluencies. In Proc. Interspeech 2015. 3650--3654. Google ScholarGoogle ScholarCross RefCross Ref
  10. Frederico Santos de Oliveira, Anderson da Silva Soares, and Arnaldo Candido Junior. [n.d.]. Brazilian Portuguese Speech Recognition Using Wav2vec 2.0. In Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21--23, 2022, Proceedings. Springer Nature, 333.Google ScholarGoogle Scholar
  11. Kallirroi Georgila, Anton Leuski, Volodymyr Yanov, and David Traum. 2020. Evaluation of Off-the-shelf Speech Recognizers Across Diverse Dialogue Domains. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 6469--6476. https://www.aclweb.org/anthology/2020.lrec-1.797Google ScholarGoogle Scholar
  12. Nathan S. Hartmann, Erick R. Fonseca, Christopher D. Shulby, Marcos V. Treviso, Jéssica S. Rodrigues, and Sandra M. Aluísio. 2017. Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. In Anais do XI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (Minas Gerais). SBC, Porto Alegre, RS, Brasil, 122--131. https://sol.sbc.org.br/index.php/stil/article/view/4008Google ScholarGoogle Scholar
  13. Ben Haynor and Petar S. Aleksic. 2020. Incorporating Written Domain Numeric Grammars into End-To-End Contextual Speech Recognition Systems for Improved Recognition of Numeric Sequences. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7809--7813. Google ScholarGoogle ScholarCross RefCross Ref
  14. Paria Jamshid Lou, Peter Anderson, and Mark Johnson. 2018. Disfluency Detection using Auto-Correlational Neural Networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4610--4619. Google ScholarGoogle ScholarCross RefCross Ref
  15. V. Kepuska and G. Bohouta. 2017. Comparing speech recognition systems (Microsoft API, Google API and CMU Sphinx). International Journal of Engineering Research and Application 7, 3 (2017), 20--24.Google ScholarGoogle ScholarCross RefCross Ref
  16. Arvind Kumar, Rampravesh Kumar, and Kamlesh Kishore. 2020. Performance analysis of ASR Model for Santhali language on Kaldi and Matlab Toolkit. In 2020 International Conference on Recent Trends on Electronics, Information, Communication Technology (RTEICT). 88--92. Google ScholarGoogle ScholarCross RefCross Ref
  17. Yogesh Kumar and Navdeep Singh. 2019. A Comprehensive View of Automatic Speech Recognition System - A Systematic Literature Review. In 2019 International Conference on Automation, Computational and Technology Management (ICACTM). 168--173. Google ScholarGoogle ScholarCross RefCross Ref
  18. Burhanuddin Lakdawala, Farhan Khan, Arif Khan, Yash Tomar, Rahul Gupta, and Ashfaq Shaikh. 2018. Voice to Text transcription using CMU Sphinx A mobile application for healthcare organization. In 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT). 749--753. Google ScholarGoogle ScholarCross RefCross Ref
  19. Benjamin Lecouteux, Michel Vacher, and François Portet. 2018. Distant Speech Processing for Smart Home: Comparison of ASR Approaches in Scattered Microphone Network for Voice Command. Int. J. Speech Technol. 21, 3 (sep 2018), 601--618. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K.-F. Lee, H.-W. Hon, and R. Reddy. 1990. An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing 38, 1 (1990), 35--45. Google ScholarGoogle ScholarCross RefCross Ref
  21. Zhenyu Li, Bin He, Xinguo Yu, and Rong Hu. 2017. Speech Interaction of Educational Robot Based on Ekho and Sphinx. In Proceedings of the 2017 International Conference on Education and Multimedia Technology (Singapore, Singapore) (ICEMT '17). Association for Computing Machinery, New York, NY, USA, 14--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Nelson Neto, Carlos Patrick, Aldebaro Klautau, and Isabel Trancoso. 2011. Free tools and resources for Brazilian Portuguese speech recognition. Journal of the Brazilian Computer Society 17, 1 (2011), 53--68.Google ScholarGoogle ScholarCross RefCross Ref
  23. Arif Nursetyo and De Rosal Ignatius Moses Setiadi. 2018. LatAksLate: Javanese Script Translator based on Indonesian Speech Recognition using Sphinx-4 and Google API. In 2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). 17--22. Google ScholarGoogle ScholarCross RefCross Ref
  24. Rafael Oliveira, Pedro Batista, Nelson Neto, and Aldebaro Klautau. 2012. Baseline Acoustic Models for Brazilian Portuguese Using CMU Sphinx Tools. In Computational Processing of the Portuguese Language. Springer Berlin Heidelberg, Berlin, Heidelberg, 375--380.Google ScholarGoogle Scholar
  25. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL '02). Association for Computational Linguistics, USA, 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Rosalind W. Picard. 1997. Affective Computing. MIT Press, Cambridge, MA, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (Hilton Waikoloa Village, Big Island, Hawaii, US). IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.Google ScholarGoogle Scholar
  28. Tommaso Raso and Heliana Mello. 2012. The C-ORAL-BRASIL I: Reference Corpus for Informal Spoken Brazilian Portuguese. In Computational Processing of the Portuguese Language. Springer Berlin Heidelberg, Berlin, Heidelberg, 362--367.Google ScholarGoogle Scholar
  29. Tânia Rocha, António Marques, José Pedro Brito, Luís Cardoso, Pedro Martins, and João Barroso. 2017. Web application for the training of the correct pronunciation of words in Portuguese for people with speech and language disorders --- preliminary usability study. In 2017 12th Iberian Conference on Information Systems and Technologies (CISTI). 1--7. Google ScholarGoogle ScholarCross RefCross Ref
  30. Johann C. Rocholl, Vicky Zayats, Daniel D. Walker, Noah B. Murad, Aaron Schneider, and Daniel J. Liebling. 2021. Disfluency Detection with Unlabeled Data and Small BERT Models. In Proc. Interspeech 2021. 766--770. Google ScholarGoogle ScholarCross RefCross Ref
  31. Morteza Rohanian and Julian Hough. 2020. Re-framing Incremental Deep Language Models for Dialogue Processing with Multi-task Learning. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 497--507. Google ScholarGoogle ScholarCross RefCross Ref
  32. Matheus Sampaio, Regis Magalhães, Ticiana Silva, Lívia Cruz, Davi Vasconcelos, José Macêdo, and Marianna Ferreira. 2021. Evaluation of Automatic Speech Recognition Systems. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados (Rio de Janeiro). SBC, Porto Alegre, RS, Brasil, 301--306. Google ScholarGoogle ScholarCross RefCross Ref
  33. Himangshu Sarma, Navanath Saharia, and Utpal Sharma. 2017. Development and Analysis of Speech Recognition Systems for Assamese Language Using HTK. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 17, 1, Article 7 (oct 2017), 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Rohit Raj Sehgal, Shubham Agarwal, and Gaurav Raj. 2018. Interactive Voice Response using Sentiment Analysis in Automatic Speech Recognition Systems. In 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE). 213--218. Google ScholarGoogle ScholarCross RefCross Ref
  35. Puwadol Sirikongtham and Worapat Paireekreng. 2017. Improving speech recognition using dynamic multi-pipeline API. In 2017 15th International Conference on ICT and Knowledge Engineering (ICTKE). 1--6. Google ScholarGoogle ScholarCross RefCross Ref
  36. V. Sneha, G. Hardhika, K. Jeeva Priya, and Deepa Gupta. 2018. Isolated Kannada Speech Recognition Using HTK---A Detailed Approach. In Progress in Advanced Computing and Intelligent Engineering. Springer Singapore, Singapore, 185--194.Google ScholarGoogle Scholar
  37. S. Supriya and S. M. Handore. 2017. Speech recognition using HTK toolkit for Marathi language. In 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI). 1591--1597. Google ScholarGoogle ScholarCross RefCross Ref
  38. Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. 2002. The HTK book. Cambridge university engineering department 3, 175 (2002), 12.Google ScholarGoogle Scholar
  39. Denis Roberto Zamignani and Sonia Beatriz Meyer. 2007. Comportamento verbal no contexto clínico: contribuições metodológicas apartir da análise do comportamento. Revista Brasileira de Terapia Comportamental e Cognitiva 9 (12 2007), 241 -- 259. http://pepsic.bvsalud.org/scielo.php?script=sci_arttext&pid=S1517-55452007000200008&nrm=isoGoogle ScholarGoogle Scholar

Index Terms

  1. Analysis of transcription tools for Brazilian Portuguese with focus on disfluency detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      IHC '22: Proceedings of the 21st Brazilian Symposium on Human Factors in Computing Systems
      October 2022
      482 pages
      ISBN:9781450395069
      DOI:10.1145/3554364

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 October 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate331of973submissions,34%
    • Article Metrics

      • Downloads (Last 12 months)27
      • Downloads (Last 6 weeks)3

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader