Abstract
Unlisted word identification is the hotspot in the research of Chinese information processing. String frequency statistics is a simple and effective method of extraction unlisted word. Existing algorithm cannot meet the requirement of high speed in vast text processing system. According to strategies of string length increasing and level-wise scanning, this paper presents a fast algorithm of extracting frequent strings and improves string frequency statistical method. The approach does not need thesaurus, and does not need to word segmentation, but according to the average mutual information to identify whether each frequent string is a word. Compared with previous approaches, experiments show that the algorithm gains advantages such as high speed, high accuracy of 91% and above.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Wang, Y.: Chinese Information Processing Technology. Press of Shanghai Jiao Tong University, Shanghai (1991)
Tan, H.: Research on Method of Automatic Recognition of Chinese Place Name based on Transformation. Research on Method of Automatic Recognition of Chinese Place Name based on Transformation 12(11), 1608–1613 (2001)
Nie, J.: Unknown Word Detection and Segmentation of Chinese using Statistical and Heuristic Knowledge. Communications of COLIPS 5(I&2), 47–57
Ling, G.C., Asahara, M., Matsumoto, Y.: Chinese Unknown Word Identification Using Character-based Tagging and Chunking. In: Companion Volume to the Proceedings of ACL 2003, Interactive. Poster/Demo Sessions, pp. 197–200 (2003)
Cui, S., Liu, Q., Meng, Y.: New Word Detection Based on Large-Scale Corpus. Journal of Computer Research and Development 43(5), 927–932 (2006)
Huang, X., Wu, L., Wang, W., Ye, D.: A Machine Learning Based Word Segmentation System without Manual Dictionary. Pattern Recognition and Artificial Intelligence 9(4), 297–303 (1996)
Luo, S., Sun, M.: Chinese Word Extraction Based on the Internal Associative Strength of Character Strings. Journal of Chinese Information Processing 17(3), 9–14 (2003)
Liu, T., Wu, Y., Wang, K.: A Chinese Word Automatic Segmentation System Based on String Frequency Statistics Combined with Word Matching. Journal of Chinese Information Processing 12(1), 17–25 (1998)
Ren, H., Zeng, J.: A Chinese Word Extraction Algorithm Based on Information Entropy. Journal of Chinese Information Processing 20(5), 40–90 (2006)
Han, K., Wang, Y., Chen, G.: Research on Fast High2frequency Strings Extracting and Statistics Algorithm with no Thesaurus. Journal of Chinese Information Processing 15(2), 23–30 (2001)
Jiang, S., Dang, Y.: Segmentation Algorithm for Chinese Text Based on Length Descending and String Frequency Statistics. Journal of the China Society for Scientific and Technical Information 25(1), 74–79 (2006)
Jin, X., Sun, Z., Zhang, F.: A Domain-independent Dictionary-free Lexical Acquisition Model for Chinese Document. Journal of Chinese Information Processing 15(6), 33–39 (2001)
MA, Y.-H., Wang, Y.-C., Su, G.-Y.: A Fast Approach of Extracting Repeated String from Chinese Text. Acta Electronica Sinca 12(12), 2177–2179 (2002)
Liu, H.: A New Approach for Doma in New Words Detection. Journal of the China Society for Scientific and Technical Information 20(5), 17–23 (2006)
Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, Y., Liu, C. (2007). An Improved Fast Algorithm of Frequent String Extracting with no Thesaurus. In: Gelbukh, A., Kuri Morales, Á.F. (eds) MICAI 2007: Advances in Artificial Intelligence. MICAI 2007. Lecture Notes in Computer Science(), vol 4827. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76631-5_85
Download citation
DOI: https://doi.org/10.1007/978-3-540-76631-5_85
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76630-8
Online ISBN: 978-3-540-76631-5
eBook Packages: Computer ScienceComputer Science (R0)