research-article

Chinese Audio Transcription Using Connectionist Temporal Classification

Authors:

Dyah E. Herwindiati,

Janson HendryliAuthors Info & Claims

ICCCM '20: Proceedings of the 8th International Conference on Computer and Communications Management

Pages 92 - 96

https://doi.org/10.1145/3411174.3411180

Published: 26 August 2020 Publication History

Abstract

Mandarin is one of the global languages that have large users and speakers. There are several important factors for learners to be an expert in Mandarin. To be able to communicate properly, mastery in Chinese character (hànzì) and pīnyīn are required. We develop an Android-based app to help students who learn Mandarin. It can help them practice the accuracy of their pronunciation and intonation in accordance to the sentences displayed on the screen, which are taken from the HSK textbook. The system recognizes human voice and transcribes it to hànzì. The recorded voice goes through the feature extraction using the filter bank method. The feature vectors are then fed into deep learning architecture to get the pīnyīn. The architectures are the convolutional neural network, recurrent neural networks, and connectionist temporal classification. After the pīnyīn letters are generated, the Markov chain rule is used to convert it into hànzì. The best word error rate from the model is 18.919% from training and 19.922% from test data. From the user testing, we find that the error rate is 49.659+16.372%, due to background noise and user's pronunciation speed.

References

[1]

Olmanson, J. and Liu, X. 2017. The challenge of Chinese character acquisition: leveraging modality in overcoming a centuries-old problem. Emerging Learning Design Journal 4 (May 2017), 1--9.

[2]

Fayek, H. 2016. Speech processing for machine learning: filter banks, mel-frequency cepstral coefficients, and what's in-between. Available from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html, accessed on Aug. 18, 2019.

[3]

Putra, D. and Resmawan, A. 2011. Verifikasi biometrika suara menggunakan metode MFCC dan DTW. Lontar Komputer 2, 1 (Jun. 2011), 8--21.

[4]

Nasution, T. 2012. Metoda mel frequency cepstrum coefficients untuk mengenali ucapan pada bahasa Indonesia. Jurnal Sains dan Teknologi Informasi 1, 1 (Jun. 2012), 22--31. DOI=https://doi.org/10.33372/stn.v1i1.309.

[5]

Permana, I. S., Nurhasanah, Y. I. and Zulkarnain, A. 2018. Implementasi metode MFCC dan DTW untuk pengenalan jenis suara pria dan wanita. Mind Journal 3, 1 (Jun. 2018), 49--63. DOI=https://doi.org/10.26760/mindjournal.

[6]

Helmiyah, S., Fadlil, A. and Yudhana, A. 2018. Pengenalan pola emosi manusia berdasarkan ucapan menggunakan ekstraksi fitur mel-frequency cepstral coefficients. CogITo Smart Journal 4, 2 (Dec. 2018), 372--381.

[7]

Golik, P., Tüske, Z., Schlüter, R. and Ney, H. 2015. Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. In Proceedings of the Annual Conference of the International Speech Communication Association (Dresden, Germany, Sept. 6-10, 2015). INTERSPEECH '15, 26--30.

[8]

Cakir, E., Ozan, E.C. and Virtanen, T. 2016. Filterbank learning for deep neural network based polyphonic sound event detection. In Proceedings of the International Joint Conference on Neural Networks (Vancouver, Canada, Jul. 24-29, 2016). IJCNN '16, IEEE, 3399--3406, DOI=https://doi.org/10.1109/IJCNN.2016.7727634.

[9]

Kamath, U., Liu, J. and Whitaker, J. 2019. Deep Learning for NLP and Speech Recognition. Mannheim: Springer Nature, Switzerland AG.

[10]

Le, T., Kim, J. and Kim, H. 2016. Classification performance using gated recurrent disaggregation. In Proceedings of the International Conference on Machine Learning and Cybernetics (Jeju, South Korea, July 10-13, 2016). IEEE, 105--110, DOI=https://doi.org/10.1109/ICMLC.2016.7860885.

[11]

Graves, A., Fernández, S. and Gomez, F. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning, ICML '06, 369--376, DOI=https://doi.org/10.1145/1143844.1143891.

[12]

Graves, A. 2012. Supervised Sequence Labelling with Recurrent Neural Networks. New York: Springer.

[13]

Zhou, X. Hu, X., Zhang, X. and Shen, X. 2017. A segment-based hidden Markov model for real-setting pinyin-to-Chinese conversion. In Proceedings of ACM Conference on Information and Knowledge Management (CIKM '07). ACM, New York, NY, 1027--1030. DOI=https://doi.org/10.1145/1321440.1321602.

[14]

Ali, A. and Renals, S. 2008. Word error rate estimation for speech recognition. In Proceedings of the Annual Meeting of the Association for Computational Linguistics 2 (Melbourne, Australia, July 15-20, 2008). ACL '08, 20--24.

[15]

Wang, D., Zhang, X. and Zhang, Z. 2015. THCHS-30: A free Chinese speech corpus. Available from http://arxiv.org/abs/1512.01882, accessed on Jan. 31, 2020.

Index Terms

Chinese Audio Transcription Using Connectionist Temporal Classification
1. Applied computing
  1. Education

Recommendations

Vowel Intelligibility in Chinese

Conventional wisdom states that, since the average amplitude of vowel articulation significantly exceeds that for consonants, an assessment of spoken intelligibility in obscuring noise should primarily be limited by consonant confusion. Furthermore, in ...
Mandarin lexical tone duration: Impact of speech style, word length, syllable position and prosodic position
Highlights
- This study aims to establish a link between speech technology and linguistic research by studying the durations of Mandarin lexical tones in large speech ...
Abstract
This study aims to increase our knowledge of Mandarin lexical tone duration in continuous Mandarin speech. Related variation factors such as the number of syllable(s) in word, the position of syllable in word, its prosodic position and ...
Effect of vocal cord polyp on Mandarin tones recognition by native Chinese speakers
ISAIMS '20: Proceedings of the 1st International Symposium on Artificial Intelligence in Medical Sciences

Intelligent Diagnosis for pathological voice contains two parts. One is intelligent detection, and the other is intelligent comprehension. Before the application of intelligent comprehension, it is important for us to know how human perceive ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICCCM '20: Proceedings of the 8th International Conference on Computer and Communications Management

July 2020

152 pages

ISBN:9781450387668

DOI:10.1145/3411174

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

Natl University of Singapore: National University of Singapore
SFU: Simon Fraser University
Western Michigan University: Western Michigan University
University of Sydney Australia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICCCM'20

ICCCM'20: 2020 The 8th International Conference on Computer and Communications Management

July 17 - 19, 2020

Singapore, Singapore

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
43
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten