Skip to main content

An Improved Fast Algorithm of Frequent String Extracting with no Thesaurus

  • Conference paper
  • 767 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4827))

Abstract

Unlisted word identification is the hotspot in the research of Chinese information processing. String frequency statistics is a simple and effective method of extraction unlisted word. Existing algorithm cannot meet the requirement of high speed in vast text processing system. According to strategies of string length increasing and level-wise scanning, this paper presents a fast algorithm of extracting frequent strings and improves string frequency statistical method. The approach does not need thesaurus, and does not need to word segmentation, but according to the average mutual information to identify whether each frequent string is a word. Compared with previous approaches, experiments show that the algorithm gains advantages such as high speed, high accuracy of 91% and above.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wang, Y.: Chinese Information Processing Technology. Press of Shanghai Jiao Tong University, Shanghai (1991)

    Google Scholar 

  2. Tan, H.: Research on Method of Automatic Recognition of Chinese Place Name based on Transformation. Research on Method of Automatic Recognition of Chinese Place Name based on Transformation 12(11), 1608–1613 (2001)

    Google Scholar 

  3. Nie, J.: Unknown Word Detection and Segmentation of Chinese using Statistical and Heuristic Knowledge. Communications of COLIPS 5(I&2), 47–57

    Google Scholar 

  4. Ling, G.C., Asahara, M., Matsumoto, Y.: Chinese Unknown Word Identification Using Character-based Tagging and Chunking. In: Companion Volume to the Proceedings of ACL 2003, Interactive. Poster/Demo Sessions, pp. 197–200 (2003)

    Google Scholar 

  5. Cui, S., Liu, Q., Meng, Y.: New Word Detection Based on Large-Scale Corpus. Journal of Computer Research and Development 43(5), 927–932 (2006)

    Article  Google Scholar 

  6. Huang, X., Wu, L., Wang, W., Ye, D.: A Machine Learning Based Word Segmentation System without Manual Dictionary. Pattern Recognition and Artificial Intelligence 9(4), 297–303 (1996)

    Google Scholar 

  7. Luo, S., Sun, M.: Chinese Word Extraction Based on the Internal Associative Strength of Character Strings. Journal of Chinese Information Processing 17(3), 9–14 (2003)

    MathSciNet  Google Scholar 

  8. Liu, T., Wu, Y., Wang, K.: A Chinese Word Automatic Segmentation System Based on String Frequency Statistics Combined with Word Matching. Journal of Chinese Information Processing 12(1), 17–25 (1998)

    MathSciNet  Google Scholar 

  9. Ren, H., Zeng, J.: A Chinese Word Extraction Algorithm Based on Information Entropy. Journal of Chinese Information Processing 20(5), 40–90 (2006)

    Google Scholar 

  10. Han, K., Wang, Y., Chen, G.: Research on Fast High2frequency Strings Extracting and Statistics Algorithm with no Thesaurus. Journal of Chinese Information Processing 15(2), 23–30 (2001)

    MathSciNet  Google Scholar 

  11. Jiang, S., Dang, Y.: Segmentation Algorithm for Chinese Text Based on Length Descending and String Frequency Statistics. Journal of the China Society for Scientific and Technical Information 25(1), 74–79 (2006)

    Google Scholar 

  12. Jin, X., Sun, Z., Zhang, F.: A Domain-independent Dictionary-free Lexical Acquisition Model for Chinese Document. Journal of Chinese Information Processing 15(6), 33–39 (2001)

    Google Scholar 

  13. MA, Y.-H., Wang, Y.-C., Su, G.-Y.: A Fast Approach of Extracting Repeated String from Chinese Text. Acta Electronica Sinca 12(12), 2177–2179 (2002)

    Google Scholar 

  14. Liu, H.: A New Approach for Doma in New Words Detection. Journal of the China Society for Scientific and Technical Information 20(5), 17–23 (2006)

    Google Scholar 

  15. Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alexander Gelbukh Ángel Fernando Kuri Morales

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, Y., Liu, C. (2007). An Improved Fast Algorithm of Frequent String Extracting with no Thesaurus. In: Gelbukh, A., Kuri Morales, Á.F. (eds) MICAI 2007: Advances in Artificial Intelligence. MICAI 2007. Lecture Notes in Computer Science(), vol 4827. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76631-5_85

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-76631-5_85

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-76630-8

  • Online ISBN: 978-3-540-76631-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics