skip to main content
10.1145/3411174.3411180acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicccmConference Proceedingsconference-collections
research-article

Chinese Audio Transcription Using Connectionist Temporal Classification

Published: 26 August 2020 Publication History

Abstract

Mandarin is one of the global languages that have large users and speakers. There are several important factors for learners to be an expert in Mandarin. To be able to communicate properly, mastery in Chinese character (hànzì) and pīnyīn are required. We develop an Android-based app to help students who learn Mandarin. It can help them practice the accuracy of their pronunciation and intonation in accordance to the sentences displayed on the screen, which are taken from the HSK textbook. The system recognizes human voice and transcribes it to hànzì. The recorded voice goes through the feature extraction using the filter bank method. The feature vectors are then fed into deep learning architecture to get the pīnyīn. The architectures are the convolutional neural network, recurrent neural networks, and connectionist temporal classification. After the pīnyīn letters are generated, the Markov chain rule is used to convert it into hànzì. The best word error rate from the model is 18.919% from training and 19.922% from test data. From the user testing, we find that the error rate is 49.659+16.372%, due to background noise and user's pronunciation speed.

References

[1]
Olmanson, J. and Liu, X. 2017. The challenge of Chinese character acquisition: leveraging modality in overcoming a centuries-old problem. Emerging Learning Design Journal 4 (May 2017), 1--9.
[2]
Fayek, H. 2016. Speech processing for machine learning: filter banks, mel-frequency cepstral coefficients, and what's in-between. Available from https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html, accessed on Aug. 18, 2019.
[3]
Putra, D. and Resmawan, A. 2011. Verifikasi biometrika suara menggunakan metode MFCC dan DTW. Lontar Komputer 2, 1 (Jun. 2011), 8--21.
[4]
Nasution, T. 2012. Metoda mel frequency cepstrum coefficients untuk mengenali ucapan pada bahasa Indonesia. Jurnal Sains dan Teknologi Informasi 1, 1 (Jun. 2012), 22--31. DOI=https://doi.org/10.33372/stn.v1i1.309.
[5]
Permana, I. S., Nurhasanah, Y. I. and Zulkarnain, A. 2018. Implementasi metode MFCC dan DTW untuk pengenalan jenis suara pria dan wanita. Mind Journal 3, 1 (Jun. 2018), 49--63. DOI=https://doi.org/10.26760/mindjournal.
[6]
Helmiyah, S., Fadlil, A. and Yudhana, A. 2018. Pengenalan pola emosi manusia berdasarkan ucapan menggunakan ekstraksi fitur mel-frequency cepstral coefficients. CogITo Smart Journal 4, 2 (Dec. 2018), 372--381.
[7]
Golik, P., Tüske, Z., Schlüter, R. and Ney, H. 2015. Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. In Proceedings of the Annual Conference of the International Speech Communication Association (Dresden, Germany, Sept. 6-10, 2015). INTERSPEECH '15, 26--30.
[8]
Cakir, E., Ozan, E.C. and Virtanen, T. 2016. Filterbank learning for deep neural network based polyphonic sound event detection. In Proceedings of the International Joint Conference on Neural Networks (Vancouver, Canada, Jul. 24-29, 2016). IJCNN '16, IEEE, 3399--3406, DOI=https://doi.org/10.1109/IJCNN.2016.7727634.
[9]
Kamath, U., Liu, J. and Whitaker, J. 2019. Deep Learning for NLP and Speech Recognition. Mannheim: Springer Nature, Switzerland AG.
[10]
Le, T., Kim, J. and Kim, H. 2016. Classification performance using gated recurrent disaggregation. In Proceedings of the International Conference on Machine Learning and Cybernetics (Jeju, South Korea, July 10-13, 2016). IEEE, 105--110, DOI=https://doi.org/10.1109/ICMLC.2016.7860885.
[11]
Graves, A., Fernández, S. and Gomez, F. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning, ICML '06, 369--376, DOI=https://doi.org/10.1145/1143844.1143891.
[12]
Graves, A. 2012. Supervised Sequence Labelling with Recurrent Neural Networks. New York: Springer.
[13]
Zhou, X. Hu, X., Zhang, X. and Shen, X. 2017. A segment-based hidden Markov model for real-setting pinyin-to-Chinese conversion. In Proceedings of ACM Conference on Information and Knowledge Management (CIKM '07). ACM, New York, NY, 1027--1030. DOI=https://doi.org/10.1145/1321440.1321602.
[14]
Ali, A. and Renals, S. 2008. Word error rate estimation for speech recognition. In Proceedings of the Annual Meeting of the Association for Computational Linguistics 2 (Melbourne, Australia, July 15-20, 2008). ACL '08, 20--24.
[15]
Wang, D., Zhang, X. and Zhang, Z. 2015. THCHS-30: A free Chinese speech corpus. Available from http://arxiv.org/abs/1512.01882, accessed on Jan. 31, 2020.

Index Terms

  1. Chinese Audio Transcription Using Connectionist Temporal Classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICCCM '20: Proceedings of the 8th International Conference on Computer and Communications Management
    July 2020
    152 pages
    ISBN:9781450387668
    DOI:10.1145/3411174
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    In-Cooperation

    • Natl University of Singapore: National University of Singapore
    • SFU: Simon Fraser University
    • Western Michigan University: Western Michigan University
    • University of Sydney Australia

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 August 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Audio transcription
    2. connectionist temporal classification
    3. mandarin
    4. word error rate

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICCCM'20

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 43
      Total Downloads
    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media