research-article

Vietnamese Voice2Text: A Web Application for Whisper Implementation in Vietnamese Automatic Speech Recognition Tasks: Vietnamese Voice2Text

Authors:
Quangphuoc Nguyen

FPT University, Vietnam

FPT University, Vietnam

0009-0008-1333-1358
View Profile

,
Ngocminh Nguyen

FPT University, Vietnam

FPT University, Vietnam

0009-0008-3015-4904
View Profile

,
Thanhluan Dang

FPT University, Vietnam

FPT University, Vietnam

0009-0007-7775-5072
View Profile

,
Vanha Tran

FPT University, Vietnam

FPT University, Vietnam

0000-0003-2714-6707
View Profile

CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial IntelligenceDecember 2023Pages 312–318https://doi.org/10.1145/3638584.3638634

Published:14 March 2024Publication History

CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence

Pages 312–318

ABSTRACT

The publication of the Whisper model by OpenAI inspired us with the idea of a web platform that provides voice-to-text conversion services for Vietnamese people. Using Whisper’s powerful generalization capabilities, we have developed a web application with three main features: record-to-text, file-to-text, and subtitles generator for YouTube. We first fine-tuned Whisper with our target language dataset then deployed the model as a Rest API using the Python Flask framework with three paths for three different tasks. The web application has been developed using ReactJS, a popular JavaScript library for building user interfaces. Its architecture is grounded in component-based design principles, which means that the application is structured into reusable and modular components, enhancing code maintainability and scalability. The web application has been developed using ReactJS, a popular JavaScript library for building user interfaces. Its architecture is grounded in component-based design principles, which means that the application is structured into reusable and modular components, enhancing code maintainability and scalability. The record-to-text function will allow users to record audio on the web page, and then the audio will be processed and converted to text. As for the file-to-text function, the website will receive audio files uploaded by users and will return the transcript text of that audio file. And finally the subtitles generator for YouTube function, where users can enter the YouTube link as input, wait for the website to process and the website will display that video with the transcript attached to the video based on the timestamps of each transcript. This project can inspire and encourage the testing and application of new automatic speech recognition (ASR) models in specific applications.

References

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.Google Scholar
James Baker. 1975. The DRAGON system–An overview. IEEE Transactions on Acoustics, speech, and signal Processing 23, 1 (1975), 24–29.Google ScholarCross Ref
William C Dersch. [n. d.]. IBM Archives: IBM Shoebox. URL http://www-03. ibm. com/ibm/history/exhibits/specialprod1/specialprod1 { _} 7 ([n. d.]).Google Scholar
Anmol Gulati, James Qin, Chung-Cheng Chiu, and Niki Parmar. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).Google Scholar
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, and Zhang. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).Google Scholar
DV Hai and ASR Challenge. 2021. Vietnamese Automatic Speech Recognition.Google Scholar
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).Google Scholar
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, and Abdel-rahman Mohamed. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29, 6 (2012), 82–97.Google Scholar
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29, 6 (2012), 82–97.Google Scholar
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29, 6 (2012), 82–97.Google Scholar
Simon A Kingaby and Simon A Kingaby. 2022. Voice User Interfaces. Data-Driven Alexa Skills: Voice Access to Rich Data Sources for Enterprise Applications (2022), 3–14.Google ScholarCross Ref
Larry R Medsker and LC Jain. 2001. Recurrent neural networks. Design and Applications 5, 64-67 (2001), 2.Google Scholar
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.Google Scholar
Wojciech Samek, Grégoire Montavon, Sebastian Lapuschkin, Christopher J Anders, and Klaus-Robert Müller. 2021. Explaining deep neural networks and beyond: A review of methods and applications. Proc. IEEE 109, 3 (2021), 247–278.Google ScholarCross Ref
Dang Dinh Son, Dang Xuan Vuong, Duong Quang Tien, Ta Bao Thang, 2022. ASR-VLSP 2021: Conformer with Gradient Mask and Stochastic Weight Averaging for Vietnamese Automatic Speech Recognition. VNU Journal of Science: Computer Science and Communication Engineering 38, 1 (2022).Google Scholar
Pham Viet Thanh, Dao Dang Huy, Luu Duc Thanh, Nguyen Duc Tan, Dang Trung Duc Anh, Nguyen Thi Thu Trang, 2022. ASR-VLSP 2021: Semi-supervised Ensemble Model for Vietnamese Automatic Speech Recognition. VNU Journal of Science: Computer Science and Communication Engineering 38, 1 (2022).Google ScholarCross Ref
NGUYEN Thi Thu Trang and NGUYEN Xuan Tung. 2019. Text-to-speech shared task in VLSP campaign 2019: evaluating Vietnamese speech synthesis on common datasets. Vietnamese Language Signal Processing. VLSP (2019).Google Scholar

Index Terms

Vietnamese Voice2Text: A Web Application for Whisper Implementation in Vietnamese Automatic Speech Recognition Tasks: Vietnamese Voice2Text
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation
      2. Speech recognition

Recommendations

HMM-Based Vietnamese Speech Synthesis

In this paper, improving naturalness HMM-based speech synthesis for Vietnamese language is described. By this synthesis method, trajectories of speech parameters are generated from the trained Hidden Markov models. A final speech waveform is synthesized ...
Read More
Modeling Vietnamese Speech Prosody: A Step-by-Step Approach Towards an Expressive Speech Synthesis System
Revised Selected Papers of the PAKDD 2015 Workshops on Trends and Applications in Knowledge Discovery and Data Mining - Volume 9441

Attempts to add expressivity to synthesized speech is one of the main strategies in speech technologies. This paper summarizes our researches on modeling Vietnamese prosody, with the goal of improving naturalness of synthesized speech in Vietnamese, as ...
Read More
Vietnamese automatic speech recognition: the FLaVoR approach
ISCSLP'06: Proceedings of the 5th international conference on Chinese Spoken Language Processing

Automatic speech recognition for languages in Southeast Asia, including Chinese, Thai and Vietnamese, typically models both acoustics and languages at the syllable level. This paper presents a new approach for recognizing those languages by exploiting ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence
December 2023
563 pages
ISBN:9798400708688
DOI:10.1145/3638584

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 March 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Audio-to-text
Automatic speech recognition
Subtitles generator
Voice-to-text
Whisper model
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 3
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Vietnamese Voice2Text: A Web Application for Whisper Implementation in Vietnamese Automatic Speech Recognition Tasks: Vietnamese Voice2Text

CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

HMM-Based Vietnamese Speech Synthesis

Modeling Vietnamese Speech Prosody: A Step-by-Step Approach Towards an Expressive Speech Synthesis System

Vietnamese automatic speech recognition: the FLaVoR approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Vietnamese Voice2Text: A Web Application for Whisper Implementation in Vietnamese Automatic Speech Recognition Tasks: Vietnamese Voice2Text

CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

HMM-Based Vietnamese Speech Synthesis

Modeling Vietnamese Speech Prosody: A Step-by-Step Approach Towards an Expressive Speech Synthesis System

Vietnamese automatic speech recognition: the FLaVoR approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media