research-article

Analysis of transcription tools for Brazilian Portuguese with focus on disfluency detection

Authors:
Alana S. Luna

Universidade de São Paulo, São Paulo, São Paulo, Brasil

Universidade de São Paulo, São Paulo, São Paulo, Brasil
View Profile

,
Ariane Machado-Lima

Universidade de São Paulo, São Paulo, Brasil

Universidade de São Paulo, São Paulo, Brasil
View Profile

,
Fátima L. S. Nunes

Universidade de São Paulo, São Paulo, Brasil

Universidade de São Paulo, São Paulo, Brasil
View Profile

IHC '22: Proceedings of the 21st Brazilian Symposium on Human Factors in Computing SystemsOctober 2022Article No.: 3Pages 1–10https://doi.org/10.1145/3554364.3559112

Published:19 October 2022Publication History

IHC '22: Proceedings of the 21st Brazilian Symposium on Human Factors in Computing Systems

Pages 1–10

ABSTRACT

Advancements and easier access to technology has led to a greater demand for applications whose interaction is performed through voice recognition, since multimedia content has been a valuable source for computational analysis. In this sense, vocal representations are extracted for various purposes in applications in several areas such as convenience, accessibility, security and sentiment analysis. The main challenge of speech recognition lies in the variability of speakers, environments, devices and the presence of disfluencies during spoken speech. These aspects influence transcription tools, essential when the user requires interaction through voice, aiming at producing texts from this interaction. In particular, detection of disfluencies can help to identify aspects related to the emotional status of the speaker. This work presents an analysis of text transcription tools, with focus in disfluency detection, encompassing the metrics most used for evaluation and databases used in evaluations in the context of Brazilian Portuguese. An experiment was conducted to evaluate the performance of three tools (IBM Watson, Google Speech and Vosk). The Google Speech tool achieved the best performance with average Word Error Rate of 9.69% for fluent sentences and 17.15% for disfluent sentences, followed by IBM Watson with 11.86% and 23.44% and Vosk with 14.39 % and 22.56% respectively.

References

Thales Aguiar de Lima and Márjory Da Costa-Abreu. 2020. A survey on automatic speech recognition systems for Portuguese language and its variations. Computer Speech Language 62 (2020), 101055. Google ScholarDigital Library
Lavanya B. Babu, Anu George, K R Sreelakshmi, and Leena Mary. 2018. Continuous Speech Recognition System for Malayalam Language Using Kaldi. In 2018 International Conference on Emerging Trends and Innovations In Engineering And Technological Research (ICETIETR). 1--4. Google ScholarCross Ref
Nguyen Bach and Fei Huang. 2019. Noisy BiLSTM-Based Models for Disfluency Detection. In Proc. Interspeech 2019. 4230--4234. Google ScholarCross Ref
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65--72. https://aclanthology.org/W05-0909Google Scholar
Dario Bertero, Linlin Wang, Ho Yin Chan, and Pascale Fung. 2015. A comparison between a DNN and a CRF disfluency detection and reconstruction system. In Proc. Interspeech 2015. 844--848. Google ScholarCross Ref
Adwoa Agyeiwaa Boakye-Yiadom, Mingwei Qin, and Ren Jing. 2021. Research of Automatic Speech Recognition of Asante-Twi Dialect For Translation. In Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering (Xiamen, China) (EITCE 2021). Association for Computing Machinery, New York, NY, USA, 1086--1094. Google ScholarDigital Library
Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, João Paulo Teixeira, Moacir Antonelli Ponti, and Sandra Aluísio. 2022. TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Language Resources and Evaluation (2022), 1--13.Google Scholar
Qian Chen, Mengzhe Chen, Bo Li, and Wen Wang. 2020. Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8069--8073. Google ScholarCross Ref
Eunah Cho, Kevin Kilgour, Jan Niehues, and Alex Waibel. 2015. Combination of NN and CRF models for joint detection of punctuation and disfluencies. In Proc. Interspeech 2015. 3650--3654. Google ScholarCross Ref
Frederico Santos de Oliveira, Anderson da Silva Soares, and Arnaldo Candido Junior. [n.d.]. Brazilian Portuguese Speech Recognition Using Wav2vec 2.0. In Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21--23, 2022, Proceedings. Springer Nature, 333.Google Scholar
Kallirroi Georgila, Anton Leuski, Volodymyr Yanov, and David Traum. 2020. Evaluation of Off-the-shelf Speech Recognizers Across Diverse Dialogue Domains. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 6469--6476. https://www.aclweb.org/anthology/2020.lrec-1.797Google Scholar
Nathan S. Hartmann, Erick R. Fonseca, Christopher D. Shulby, Marcos V. Treviso, Jéssica S. Rodrigues, and Sandra M. Aluísio. 2017. Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. In Anais do XI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (Minas Gerais). SBC, Porto Alegre, RS, Brasil, 122--131. https://sol.sbc.org.br/index.php/stil/article/view/4008Google Scholar
Ben Haynor and Petar S. Aleksic. 2020. Incorporating Written Domain Numeric Grammars into End-To-End Contextual Speech Recognition Systems for Improved Recognition of Numeric Sequences. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7809--7813. Google ScholarCross Ref
Paria Jamshid Lou, Peter Anderson, and Mark Johnson. 2018. Disfluency Detection using Auto-Correlational Neural Networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4610--4619. Google ScholarCross Ref
V. Kepuska and G. Bohouta. 2017. Comparing speech recognition systems (Microsoft API, Google API and CMU Sphinx). International Journal of Engineering Research and Application 7, 3 (2017), 20--24.Google ScholarCross Ref
Arvind Kumar, Rampravesh Kumar, and Kamlesh Kishore. 2020. Performance analysis of ASR Model for Santhali language on Kaldi and Matlab Toolkit. In 2020 International Conference on Recent Trends on Electronics, Information, Communication Technology (RTEICT). 88--92. Google ScholarCross Ref
Yogesh Kumar and Navdeep Singh. 2019. A Comprehensive View of Automatic Speech Recognition System - A Systematic Literature Review. In 2019 International Conference on Automation, Computational and Technology Management (ICACTM). 168--173. Google ScholarCross Ref
Burhanuddin Lakdawala, Farhan Khan, Arif Khan, Yash Tomar, Rahul Gupta, and Ashfaq Shaikh. 2018. Voice to Text transcription using CMU Sphinx A mobile application for healthcare organization. In 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT). 749--753. Google ScholarCross Ref
Benjamin Lecouteux, Michel Vacher, and François Portet. 2018. Distant Speech Processing for Smart Home: Comparison of ASR Approaches in Scattered Microphone Network for Voice Command. Int. J. Speech Technol. 21, 3 (sep 2018), 601--618. Google ScholarDigital Library
K.-F. Lee, H.-W. Hon, and R. Reddy. 1990. An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing 38, 1 (1990), 35--45. Google ScholarCross Ref
Zhenyu Li, Bin He, Xinguo Yu, and Rong Hu. 2017. Speech Interaction of Educational Robot Based on Ekho and Sphinx. In Proceedings of the 2017 International Conference on Education and Multimedia Technology (Singapore, Singapore) (ICEMT '17). Association for Computing Machinery, New York, NY, USA, 14--20. Google ScholarDigital Library
Nelson Neto, Carlos Patrick, Aldebaro Klautau, and Isabel Trancoso. 2011. Free tools and resources for Brazilian Portuguese speech recognition. Journal of the Brazilian Computer Society 17, 1 (2011), 53--68.Google ScholarCross Ref
Arif Nursetyo and De Rosal Ignatius Moses Setiadi. 2018. LatAksLate: Javanese Script Translator based on Indonesian Speech Recognition using Sphinx-4 and Google API. In 2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). 17--22. Google ScholarCross Ref
Rafael Oliveira, Pedro Batista, Nelson Neto, and Aldebaro Klautau. 2012. Baseline Acoustic Models for Brazilian Portuguese Using CMU Sphinx Tools. In Computational Processing of the Portuguese Language. Springer Berlin Heidelberg, Berlin, Heidelberg, 375--380.Google Scholar
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL '02). Association for Computational Linguistics, USA, 311--318. Google ScholarDigital Library
Rosalind W. Picard. 1997. Affective Computing. MIT Press, Cambridge, MA, USA.Google ScholarDigital Library
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (Hilton Waikoloa Village, Big Island, Hawaii, US). IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.Google Scholar
Tommaso Raso and Heliana Mello. 2012. The C-ORAL-BRASIL I: Reference Corpus for Informal Spoken Brazilian Portuguese. In Computational Processing of the Portuguese Language. Springer Berlin Heidelberg, Berlin, Heidelberg, 362--367.Google Scholar
Tânia Rocha, António Marques, José Pedro Brito, Luís Cardoso, Pedro Martins, and João Barroso. 2017. Web application for the training of the correct pronunciation of words in Portuguese for people with speech and language disorders --- preliminary usability study. In 2017 12th Iberian Conference on Information Systems and Technologies (CISTI). 1--7. Google ScholarCross Ref
Johann C. Rocholl, Vicky Zayats, Daniel D. Walker, Noah B. Murad, Aaron Schneider, and Daniel J. Liebling. 2021. Disfluency Detection with Unlabeled Data and Small BERT Models. In Proc. Interspeech 2021. 766--770. Google ScholarCross Ref
Morteza Rohanian and Julian Hough. 2020. Re-framing Incremental Deep Language Models for Dialogue Processing with Multi-task Learning. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 497--507. Google ScholarCross Ref
Matheus Sampaio, Regis Magalhães, Ticiana Silva, Lívia Cruz, Davi Vasconcelos, José Macêdo, and Marianna Ferreira. 2021. Evaluation of Automatic Speech Recognition Systems. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados (Rio de Janeiro). SBC, Porto Alegre, RS, Brasil, 301--306. Google ScholarCross Ref
Himangshu Sarma, Navanath Saharia, and Utpal Sharma. 2017. Development and Analysis of Speech Recognition Systems for Assamese Language Using HTK. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 17, 1, Article 7 (oct 2017), 14 pages. Google ScholarDigital Library
Rohit Raj Sehgal, Shubham Agarwal, and Gaurav Raj. 2018. Interactive Voice Response using Sentiment Analysis in Automatic Speech Recognition Systems. In 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE). 213--218. Google ScholarCross Ref
Puwadol Sirikongtham and Worapat Paireekreng. 2017. Improving speech recognition using dynamic multi-pipeline API. In 2017 15th International Conference on ICT and Knowledge Engineering (ICTKE). 1--6. Google ScholarCross Ref
V. Sneha, G. Hardhika, K. Jeeva Priya, and Deepa Gupta. 2018. Isolated Kannada Speech Recognition Using HTK---A Detailed Approach. In Progress in Advanced Computing and Intelligent Engineering. Springer Singapore, Singapore, 185--194.Google Scholar
S. Supriya and S. M. Handore. 2017. Speech recognition using HTK toolkit for Marathi language. In 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI). 1591--1597. Google ScholarCross Ref
Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. 2002. The HTK book. Cambridge university engineering department 3, 175 (2002), 12.Google Scholar
Denis Roberto Zamignani and Sonia Beatriz Meyer. 2007. Comportamento verbal no contexto clínico: contribuições metodológicas apartir da análise do comportamento. Revista Brasileira de Terapia Comportamental e Cognitiva 9 (12 2007), 241 -- 259. http://pepsic.bvsalud.org/scielo.php?script=sci_arttext&pid=S1517-55452007000200008&nrm=isoGoogle Scholar

Index Terms

Analysis of transcription tools for Brazilian Portuguese with focus on disfluency detection
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interactive systems and tools

Recommendations

Baseline acoustic models for brazilian portuguese using CMU sphinx tools
PROPOR'12: Proceedings of the 10th international conference on Computational Processing of the Portuguese Language

Advances in speech processing research rely on the availability of public resources such as corpora, statistical models and baseline systems. In contrast to languages such as English, there are few specific resources for Brazilian Portuguese. This work ...
Read More
Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

Human-machine interaction in meetings requires the localization and identification of the speakers interacting with the system as well as the recognition of the words spoken. A seminal step toward this goal is the field of rich transcription research, ...
Read More
European Portuguese Accent in Acoustic Models for Non-native English Speakers
Progress in Pattern Recognition, Image Analysis and Applications
Abstract
The development of automatic speech recognition systems poses several known difficulties. One of them concerns the recognizer’s accuracy when dealing with non-native speakers of a given language. Normally a recognizer precision is lower for non-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IHC '22: Proceedings of the 21st Brazilian Symposium on Human Factors in Computing Systems
October 2022
482 pages
ISBN:9781450395069
DOI:10.1145/3554364
General Chairs:
Caroline Queiroz Santos
Universidade Federal dos Vales do Jequitinhonha e Mucuri (UFVJM)
,
Maria Lúcia Bento Villela
Universidade Federal de Viçosa (UFV)
,
Program Chairs:
Kamila Rios da Hora Rodrigues
Universidade de São Paulo (ICMC/USP)
,
Ticianne de Gois Ribeiro Darin
Universidade Federal do Ceará (UFC)
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
disfluencies
natural language processing
rich transcription
speech recognition
spoken dialogue
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate331of973submissions,34%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 49
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Analysis of transcription tools for Brazilian Portuguese with focus on disfluency detection

IHC '22: Proceedings of the 21st Brazilian Symposium on Human Factors in Computing Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Baseline acoustic models for brazilian portuguese using CMU sphinx tools

Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

European Portuguese Accent in Acoustic Models for Non-native English Speakers