skip to main content
10.1145/1321440.1321602acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

A segment-based hidden markov model for real-setting pinyin-to-chinese conversion

Published: 06 November 2007 Publication History

Abstract

Hidden markov model (HMM) is frequently used for Pinyin-to-Chinese conversion. But it only captures the dependency with the preceding character. Higher order markov models can bring higher accuracy, but are computationally unaffordable to average PC settings. We propose a segment-based hidden markov model (SHMM), which has the same magnitude of complexity as first-order HMM, but generates higher decoding accuracy. SHMM tells a word from a bigram connecting two words, and assigns a reasonable probability to words as a whole. It is more powerful than HMM to decode words containing over two characters. We conduct a comprehensive Pinyin-to-Chinese conversion evaluation on Lancaster corpus. The experiment shows the perfect sentence accuracy is improved from 34.7% (HMM) to 43.3% (SHMM). The one-error sentence accuracy is increased from 72.7% to 78.3%. Furthermore, SHMM can seamlessly integrate with pinyin typing correction, acronym pinyin input, user-defined words, and self-adaptive learning all of which are a must for a commercial Pinyin-to-Chinese conversion product in order to improve the efficiency of pinyin input.

References

[1]
Z. Chen, K. F. Lee, A new statistical approach to Chinese pinyin input. In ACL-2000, Hong Kong, 2000, 241--247.
[2]
Jianfeng Gao, Hai-Feng Wang, Mingjing Li, Kai-Fu Lee. 2000. A Unified Approach to Statistical Language Modeling for Chinese. IEEE, ICASSP 2000.
[3]
Jelinek, F. And Mercer, R. Interpolated estimation of markov source parameters from sparse data, Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds., 1980, pp. 381--402.
[4]
A. McEnery, Z. Xiao, The Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study, LREC 2004 Proceedings, pp. 1175--1178.
[5]
Rabiner, L. R., A tutorial on hidden Markov models and selected applications, In IEEE proceedings of speech recognition, 77(2), Feb, 1989: 257--286.
[6]
P. K. Wong, C. K. Chan, Chinese word segmentation based on maximum matching and word binding force, in Proc. 16th International Conference on Computational Linguistics, Copenhagen, Denmark, 1996, pp. 200--203.
[7]
Y. Zhang, B. Xu, C. Zong, Rule-based Post-Processing of Pinyin to Chinese Characters Conversion System. ISCSLP, 2006.
[8]
Y. Zhao, X. Wang, B. Liu and Y. Guan, Research of Pinyin-to-Character conversion based on Maximum Entropy model, Journal of Electronics (China), 23(6), November, 2006, pp 864--869.

Cited By

View all
  • (2021)Pinyin-to-Chinese conversion on sentence-level for domain-specific applications using self-attention modelMultimedia Systems10.1007/s00530-021-00829-y28:2(375-386)Online publication date: 17-Jul-2021
  • (2010)Detecting word misuse in ChineseProceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media10.5555/1860667.1860670(5-6)Online publication date: 6-Jun-2010
  • (2009)Chinese Pinyin-Text Conversion on Segmented TextProceedings of the 12th International Conference on Text, Speech and Dialogue10.1007/978-3-642-04208-9_19(116-123)Online publication date: 25-Aug-2009

Index Terms

  1. A segment-based hidden markov model for real-setting pinyin-to-chinese conversion

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
    November 2007
    1048 pages
    ISBN:9781595938039
    DOI:10.1145/1321440
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 November 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. chinese input
    2. pinyin
    3. segment-based hidden markov model

    Qualifiers

    • Poster

    Conference

    CIKM07

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Pinyin-to-Chinese conversion on sentence-level for domain-specific applications using self-attention modelMultimedia Systems10.1007/s00530-021-00829-y28:2(375-386)Online publication date: 17-Jul-2021
    • (2010)Detecting word misuse in ChineseProceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media10.5555/1860667.1860670(5-6)Online publication date: 6-Jun-2010
    • (2009)Chinese Pinyin-Text Conversion on Segmented TextProceedings of the 12th International Conference on Text, Speech and Dialogue10.1007/978-3-642-04208-9_19(116-123)Online publication date: 25-Aug-2009

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media