An Improved Fast Algorithm of Frequent String Extracting with no Thesaurus

Zhang, Yumeng; Liu, Chuanhan

doi:10.1007/978-3-540-76631-5_85

An Improved Fast Algorithm of Frequent String Extracting with no Thesaurus

Yumeng Zhang^1,2 &
Chuanhan Liu¹

Conference paper

767 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4827))

Abstract

Unlisted word identification is the hotspot in the research of Chinese information processing. String frequency statistics is a simple and effective method of extraction unlisted word. Existing algorithm cannot meet the requirement of high speed in vast text processing system. According to strategies of string length increasing and level-wise scanning, this paper presents a fast algorithm of extracting frequent strings and improves string frequency statistical method. The approach does not need thesaurus, and does not need to word segmentation, but according to the average mutual information to identify whether each frequent string is a word. Compared with previous approaches, experiments show that the algorithm gains advantages such as high speed, high accuracy of 91% and above.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wang, Y.: Chinese Information Processing Technology. Press of Shanghai Jiao Tong University, Shanghai (1991)
Google Scholar
Tan, H.: Research on Method of Automatic Recognition of Chinese Place Name based on Transformation. Research on Method of Automatic Recognition of Chinese Place Name based on Transformation 12(11), 1608–1613 (2001)
Google Scholar
Nie, J.: Unknown Word Detection and Segmentation of Chinese using Statistical and Heuristic Knowledge. Communications of COLIPS 5(I&2), 47–57
Google Scholar
Ling, G.C., Asahara, M., Matsumoto, Y.: Chinese Unknown Word Identification Using Character-based Tagging and Chunking. In: Companion Volume to the Proceedings of ACL 2003, Interactive. Poster/Demo Sessions, pp. 197–200 (2003)
Google Scholar
Cui, S., Liu, Q., Meng, Y.: New Word Detection Based on Large-Scale Corpus. Journal of Computer Research and Development 43(5), 927–932 (2006)
Article Google Scholar
Huang, X., Wu, L., Wang, W., Ye, D.: A Machine Learning Based Word Segmentation System without Manual Dictionary. Pattern Recognition and Artificial Intelligence 9(4), 297–303 (1996)
Google Scholar
Luo, S., Sun, M.: Chinese Word Extraction Based on the Internal Associative Strength of Character Strings. Journal of Chinese Information Processing 17(3), 9–14 (2003)
MathSciNet Google Scholar
Liu, T., Wu, Y., Wang, K.: A Chinese Word Automatic Segmentation System Based on String Frequency Statistics Combined with Word Matching. Journal of Chinese Information Processing 12(1), 17–25 (1998)
MathSciNet Google Scholar
Ren, H., Zeng, J.: A Chinese Word Extraction Algorithm Based on Information Entropy. Journal of Chinese Information Processing 20(5), 40–90 (2006)
Google Scholar
Han, K., Wang, Y., Chen, G.: Research on Fast High2frequency Strings Extracting and Statistics Algorithm with no Thesaurus. Journal of Chinese Information Processing 15(2), 23–30 (2001)
MathSciNet Google Scholar
Jiang, S., Dang, Y.: Segmentation Algorithm for Chinese Text Based on Length Descending and String Frequency Statistics. Journal of the China Society for Scientific and Technical Information 25(1), 74–79 (2006)
Google Scholar
Jin, X., Sun, Z., Zhang, F.: A Domain-independent Dictionary-free Lexical Acquisition Model for Chinese Document. Journal of Chinese Information Processing 15(6), 33–39 (2001)
Google Scholar
MA, Y.-H., Wang, Y.-C., Su, G.-Y.: A Fast Approach of Extracting Repeated String from Chinese Text. Acta Electronica Sinca 12(12), 2177–2179 (2002)
Google Scholar
Liu, H.: A New Approach for Doma in New Words Detection. Journal of the China Society for Scientific and Technical Information 20(5), 17–23 (2006)
Google Scholar
Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200030, China
Yumeng Zhang & Chuanhan Liu
School of Business, Ningbo University, Ningbo, 315211, China
Yumeng Zhang

Authors

Yumeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chuanhan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alexander Gelbukh Ángel Fernando Kuri Morales

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Liu, C. (2007). An Improved Fast Algorithm of Frequent String Extracting with no Thesaurus. In: Gelbukh, A., Kuri Morales, Á.F. (eds) MICAI 2007: Advances in Artificial Intelligence. MICAI 2007. Lecture Notes in Computer Science(), vol 4827. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76631-5_85

Download citation

DOI: https://doi.org/10.1007/978-3-540-76631-5_85
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76630-8
Online ISBN: 978-3-540-76631-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics