poster

A segment-based hidden markov model for real-setting pinyin-to-chinese conversion

Authors:

Xiaohua Zhou,

Xiaohua Hu,

Xiaodan Zhang,

Xiajiong ShenAuthors Info & Claims

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Pages 1027 - 1030

https://doi.org/10.1145/1321440.1321602

Published: 06 November 2007 Publication History

Get Access

Abstract

Hidden markov model (HMM) is frequently used for Pinyin-to-Chinese conversion. But it only captures the dependency with the preceding character. Higher order markov models can bring higher accuracy, but are computationally unaffordable to average PC settings. We propose a segment-based hidden markov model (SHMM), which has the same magnitude of complexity as first-order HMM, but generates higher decoding accuracy. SHMM tells a word from a bigram connecting two words, and assigns a reasonable probability to words as a whole. It is more powerful than HMM to decode words containing over two characters. We conduct a comprehensive Pinyin-to-Chinese conversion evaluation on Lancaster corpus. The experiment shows the perfect sentence accuracy is improved from 34.7% (HMM) to 43.3% (SHMM). The one-error sentence accuracy is increased from 72.7% to 78.3%. Furthermore, SHMM can seamlessly integrate with pinyin typing correction, acronym pinyin input, user-defined words, and self-adaptive learning all of which are a must for a commercial Pinyin-to-Chinese conversion product in order to improve the efficiency of pinyin input.

References

[1]

Z. Chen, K. F. Lee, A new statistical approach to Chinese pinyin input. In ACL-2000, Hong Kong, 2000, 241--247.

Digital Library

Google Scholar

[2]

Jianfeng Gao, Hai-Feng Wang, Mingjing Li, Kai-Fu Lee. 2000. A Unified Approach to Statistical Language Modeling for Chinese. IEEE, ICASSP 2000.

Google Scholar

[3]

Jelinek, F. And Mercer, R. Interpolated estimation of markov source parameters from sparse data, Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds., 1980, pp. 381--402.

Google Scholar

[4]

A. McEnery, Z. Xiao, The Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study, LREC 2004 Proceedings, pp. 1175--1178.

Google Scholar

[5]

Rabiner, L. R., A tutorial on hidden Markov models and selected applications, In IEEE proceedings of speech recognition, 77(2), Feb, 1989: 257--286.

Google Scholar

[6]

P. K. Wong, C. K. Chan, Chinese word segmentation based on maximum matching and word binding force, in Proc. 16th International Conference on Computational Linguistics, Copenhagen, Denmark, 1996, pp. 200--203.

Digital Library

Google Scholar

[7]

Y. Zhang, B. Xu, C. Zong, Rule-based Post-Processing of Pinyin to Chinese Characters Conversion System. ISCSLP, 2006.

Google Scholar

[8]

Y. Zhao, X. Wang, B. Liu and Y. Guan, Research of Pinyin-to-Character conversion based on Maximum Entropy model, Journal of Electronics (China), 23(6), November, 2006, pp 864--869.

Google Scholar

Cited By

View all

Xiong SMa LCheng MWang B(2021)Pinyin-to-Chinese conversion on sentence-level for domain-specific applications using self-attention modelMultimedia Systems10.1007/s00530-021-00829-y28:2(375-386)Online publication date: 17-Jul-2021
https://doi.org/10.1007/s00530-021-00829-y
Liu WHachey BOsborne M(2010)Detecting word misuse in ChineseProceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media10.5555/1860667.1860670(5-6)Online publication date: 6-Jun-2010
https://dl.acm.org/doi/10.5555/1860667.1860670
Liu WGuthrie L(2009)Chinese Pinyin-Text Conversion on Segmented TextProceedings of the 12th International Conference on Text, Speech and Dialogue10.1007/978-3-642-04208-9_19(116-123)Online publication date: 25-Aug-2009
https://dl.acm.org/doi/10.1007/978-3-642-04208-9_19

Index Terms

A segment-based hidden markov model for real-setting pinyin-to-chinese conversion
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Segment-based hidden Markov models for information extraction
ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics

Hidden Markov models (HMMs) are powerful statistical models that have found successful applications in Information Extraction (IE). In current approaches to applying HMMs to IE, an HMM is used to model text at the document level. This modelling might ...
Chinese lexical analysis using hierarchical hidden Markov model
SIGHAN '03: Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17

This paper presents a unified approach for Chinese lexical analysis using hierarchical hidden Markov model (HHMM), which aims to incorporate Chinese word segmentation, Part-Of-Speech tagging, disambiguation and unknown words recognition into a whole ...
A Hidden Semi-Markov Model-Based Speech Synthesis System

A statistical speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and duration of speech are modeled simultaneously by context-dependent HMMs, and speech parameter vector sequences ...

Comments

Information & Contributors

Information

Published In

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

November 2007

1048 pages

ISBN:9781595938039

DOI:10.1145/1321440

Co-chair:
Alberto H. F. Laender,
Conference Chairs:
André O. Falcão
Universidade de Lisboa, Portugal
,
Øystein Haug Olsen,
General Chair:
Mário J. Silva
(Universidade de Lisboa, Portugal)
,
Program Chairs:
Ricardo Baeza-Yates,
Deborah L. McGuinness,
Bjorn Olstad

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

CIKM07

Sponsor:

CIKM07: Conference on Information and Knowledge Management

November 6 - 10, 2007

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
336
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Xiong SMa LCheng MWang B(2021)Pinyin-to-Chinese conversion on sentence-level for domain-specific applications using self-attention modelMultimedia Systems10.1007/s00530-021-00829-y28:2(375-386)Online publication date: 17-Jul-2021
https://doi.org/10.1007/s00530-021-00829-y
Liu WHachey BOsborne M(2010)Detecting word misuse in ChineseProceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media10.5555/1860667.1860670(5-6)Online publication date: 6-Jun-2010
https://dl.acm.org/doi/10.5555/1860667.1860670
Liu WGuthrie L(2009)Chinese Pinyin-Text Conversion on Segmented TextProceedings of the 12th International Conference on Text, Speech and Dialogue10.1007/978-3-642-04208-9_19(116-123)Online publication date: 25-Aug-2009
https://dl.acm.org/doi/10.1007/978-3-642-04208-9_19

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Segment-based hidden Markov models for information extraction

Chinese lexical analysis using hierarchical hidden Markov model

A Hidden Semi-Markov Model-Based Speech Synthesis System

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations