Skip to main content
Log in

A Dialectal Chinese Speech Recognition Framework

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

A framework for dialectal Chinese speech recognition is proposed and studied, in which a relatively small dialectal Chinese (or in other words Chinese influenced by the native dialect) speech corpus and dialect-related knowledge are adopted to transform a standard Chinese (or Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese speech recognizer. Two kinds of knowledge sources are explored: one is expert knowledge and the other is a small dialectal Chinese corpus. These knowledge sources provide information at four levels: phonetic level, lexicon level, language level, and acoustic decoder level. This paper takes Wu dialectal Chinese (WDC) as an example target language. The goal is to establish a WDC speech recognizer from an existing PTH speech recognizer based on the Initial-Final structure of the Chinese language and a study of how dialectal Chinese speakers speak Putonghua. The authors propose to use context-independent PTH-IF mappings (where IF means either a Chinese Initial or a Chinese Final), context-independent WDC-IF mappings, and syllable-dependent WDC-IF mappings (obtained from either experts or data), and combine them with the supervised maximum likelihood linear regression (MLLR) acoustic model adaptation method. To reduce the size of the multi-pronunciation lexicon introduced by the IF mappings, which might also enlarge the lexicon confusion and hence lead to the performance degradation, a Multi-Pronunciation Expansion (MPE) method based on the accumulated uni-gram probability (AUP) is proposed. In addition, some commonly used WDC words are selected and added to the lexicon. Compared with the original PTH speech recognizer, the resulting WDC speech recognizer achieves 10–18% absolute Character Error Rate (CER) reduction when recognizing WDC, with only a 0.62% CER increase when recognizing PTH. The proposed framework and methods are expected to work not only for Wu dialectal Chinese but also for other dialectal Chinese languages and even other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Leggetter C J, Woodland P C. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, April 1995, 9(2): 171–185.

    Google Scholar 

  2. Strik H, Cucchiarini C. Modeling pronunciation variation for ASR: A survey of the literature. Speech Communication, 1999, 29: 225–246.

    Google Scholar 

  3. Jurafsky D et al. What kind of pronunciation variation is hard for triphones to model? In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP'2001), 2001, pp.577–580.

  4. Byrne W, Finke M, Khudanpur S et al. Pronunciation modelling using a hand-labelled corpus for conversational speech recognition. In Proc. Int. Conf. Acoustics, Speech and Signal Processing (ICASSP'1998), Seattle, USA, 1998, pp.313–316.

  5. Fosler-Lussier E. Dynamic pronunciation models for automatic speech recognition [Dissertation]. University of California, Berkeley, CA, 1999.

  6. Saraclar M. Pronunciation modeling for conversational speech recognition [Dissertation]. The Johns Hopkins University, Baltimore, MD, 2000.

  7. Zheng F, Song Z J, Fung P, Byrne W. Modeling pronunciation variation using context-dependent weighting and B/S refined acoustic modeling. In Proc. EuroSpeech, Aalborg, Denmark, Sept. 3–7, 2001, 1: 57–60.

  8. Zheng F, Song Z J, Fung P et al. Mandarin pronunciation modeling based on CASS corpus. J. Computer Science & Technology, 2002, 17(3): 249–263.

    Google Scholar 

  9. Wester M. Pronunciation modeling for ASR—knowledge-based and data-derived methods. Computer Speech and Language, 2003, 17: 69–85.

    Article  Google Scholar 

  10. Huang C. Accent issue in large vocabulary continuous speech recognition. Microsoft Research Technical Report. MSR-TR-2001-69, 2001.

  11. Ikeno A, Pellom B, Cer D et al. Issues in recognition of Spanish-accented spontaneous English. In Proc. IEEE/ISCA Workshop on Spontaneous Speech Processing and Recognition, Tokyo, Japan, 2003.

  12. Tomokiyo L M. Recognizing non-native speech: Characterizing and adapting to non-native usage in LVCSR [Dissertation]. Carnegie Mellon University, 2001.

  13. Wang Z R, Schultz T, Waibel A. Comparison of acoustic model adaptation techniques on non-native speech. In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP'2003), Hong Kong, 2003, 1: 540–543.

  14. LDC. 1992. http://wave.ldc.upenn.edu.

  15. Young S, Evermann G, Hain T et al. The HTK Book (for HTK Version 3.2.1), 2002, http://htk.eng.cam.ac.uk.

  16. Stolcke A. SRILM—An extensible language modeling toolkit. In Proc. International Conference on Spoken Language Processing (ICSLP'2002), Denver, 2002, 2: 901–904.

  17. NIST. The 1997 Hub-4NE evaluation plan for recognition of Broadcast News, in Spanish and Mandarin. 1997, http://www.nist.gov/speech/tests/bnr/hub4ne_97/current_plan.htm.

  18. Li J, Zheng F, Xiong Z Y et al. Construction of large-scale Shanghai putonghua speech corpus for Chinese speech recognition. In Proc. Oriental-COCOSDA, Sentosa, Singapore, Oct. 1–3, 2003, pp.62–69.

  19. Rosenfeld R. Two decades of statistical language modeling: Where do we go from here? In Proc. IEEE, 2000, 88: 1270–1278.

  20. Zheng F, Wu J, Song Z J. Improving the syllable-synchronous network search algorithm for word decoding in continuous Chinese speech recognition. J. Computer Science & Technology, Sept. 2000, 15(5): 461–471.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Li.

Additional information

This paper is based upon a study supported by the US National Science Foundation under Grant No.0121285. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Jing Li is currently a Ph.D. candidate of Center for Speech Technology, the State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University. He received his B.S. degree in computer science and technology from Tsinghua University, in 1999. He is now focusing on dialectal Chinese speech recognition, acoustic modeling, and keyword spotting.

Thomas Fang Zheng graduated from the Department of Computer Science & Technology of Tsinghua University and received his B.S., M.S. and Ph.D. degrees from Tsinghua University, in 1990, 1992 and 1997 respectively. Dr. Zheng is currently a professor at Tsinghua University. He is Vice Dean of Research Institute of Information Technology of Tsinghua University, and the Director of Center of Speech Technology, State Laboratory of Intelligent Technology and Systems. Dr. Zheng is now the Council Chair of the Chinese Corpus Consortium, an IEEE member, an ISCA member, a senior member of China Computer Federation, a member of the Artificial Intelligence and Pattern Recognition Technical Commission of China Computer Federation, a member of the editorial committee of the Journal of Chinese Information Processing, and a key member of Oriental-COCOSDA. He was a senior member and a co-leader at the Johns Hopkins University's Summer Workshop of Language and Speech Processing, in 2000 and 2004, working on pronunciation modeling and dialectal Chinese recognition, respectively. His main research interests are speech recognition, natural language understanding, and speaker recognition.

William Byrne received the B.S. degree in electrical engineering from Cornell University, Ithaca, NY in 1982, and the Ph.D. degree in electrical engineering from the University of Maryland, College Park, MA in 1993. He has worked at Entropic Research Laboratory, Washington DC, and the National Institutes of Health, Bethesda, MD. He is currently a research associate professor in the Department of Electrical Engineering and the Center for Language and Speech Processing at the Johns Hopkins University, Baltimore, MD, and a university lecturer in the Machine Intelligence Laboratory and a member of the Speech Research Group, Cambridge University, UK. His main research interests are in statistical modeling techniques for speech and language processing, with a recent interest in statistical machine translation.

Dan Jurafsky is an associate professor in the Department of Linguistics, Stanford University, where he just arrived in January of 2004. He received his B.A. degree in Linguistics in 1983, and his Ph.D. degree in computer science in 1992, both from UC Berkeley. He then worked for 8 years at the University of Colorado at Boulder, where he was an assistant and associate professor in the Department of Linguistics, the Institute of Cognitive Science, the Department of Computer Science, and the Center for Spoken Language Research. He still maintains an adjunct position at the University of Colorado, and continues to work closely with colleagues there. His research focuses on statistical models of human and machine language processing, especially computational linguistics, automatic speech recognition and understanding, computational psycholinguistics, and natural language processing. He received the National Science Foundation CAREER award in 1998, the MacArthur Fellowship in 2002. His most recent book, with James H. Martin, is the widely-used textbook “Speech and Language Processing”.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Zheng, T.F., Byrne, W. et al. A Dialectal Chinese Speech Recognition Framework. J Comput Sci Technol 21, 106–115 (2006). https://doi.org/10.1007/s11390-006-0106-9

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-006-0106-9

Keywords

Navigation