Abstract
A Chinese word segmentation algorithm based on forward maximum matching and word binding force is proposed in this paper. To support this algorithm, a text corpus of over 63 millions characters is employed to enrich an 80,000-words lexicon in terms of its word entries and word binding forces. As it stands now, given an input line of text, the word segmentor can process on the average 210,000 characters per second when running on an IBM RISC System/6000 3BT workstation with a correct word identification rate of 99.74%. The proposed word segmentation algorithm can be applied to process the huge amount of Chinese information on the Internet.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Devore, J.L.: Probability and statistics for engineering and sciences, pp. 272–276. Duxbury Press, Boston (1991)
Liu, Y., Tan, Q., Shen, K.X.: The word segmentation rules and automatic word segmentation methods for Chinese information processing (in Chinese), vol. 36. Qing Hua University Press and Guang Xi Science and Technology Press (1994)
Lua, K.-T., Gan, K.-W.: An application of information theory in Chinese word segmentation. Computer Processing of Chinese and Oriental Languages 8(1), 115–123 (1994)
Lua, K.T.: ¿From character to word – an application of information theory. Computer Processing of Chinese and Oriental Languages 4(4), 304–313 (1990)
Sproat, R., Shih, C.: Asta tistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages 4(4), 336–349 (1990)
Wang, L.-J., Pei, T., Li, W.-C., Huang, L.-C.R.: Ap arsing method for identifying words in mandarin Chinese sentences. In: Processings of 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, pp. 1018–1023 (1991)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wong, P.K. (1999). An Efficient Chinese Word Segmentation Algorithm for Chinese Information Processing on the Internet. In: Hui, L.C.K., Lee, DL. (eds) Internet Applications. ICSC 1999. Lecture Notes in Computer Science, vol 1749. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-46652-9_47
Download citation
DOI: https://doi.org/10.1007/978-3-540-46652-9_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66903-6
Online ISBN: 978-3-540-46652-9
eBook Packages: Springer Book Archive