Skip to main content

Query Based Chinese Phrase Extraction for Site Search

  • Conference paper
Web Information Systems – WISE 2004 (WISE 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3306))

Included in the following conference series:

  • 1170 Accesses

Abstract

Word segmentation(WS) is one of the major issues of information processing in character-based languages, for there are no explicit word boundaries in these languages. Moreover, a combination of multiple continuous words, a phrase, is usually a minimum meaningful unit. Although much work has been done on WS, in site web search, little has been explored to mine site-specific knowledge from user query log for both more accurate WS and better retrieval performance. This paper proposes a novel, statistics-based method to extract phrases based on user query log. The extracted phrases, combined with a general, static dictionary, construct a dynamic, site-specific dictionary. According to the dictionary, web documents are segmented into phrases and words, which are kept as separate index terms to build phrase enhanced index for site search. The experiment result shows that our approach greatly improves the retrieval performance. It also helps to detect many out-of-vocabulary words, such as site-specific phrases, newly created words and names of people and locations, which are difficult to process with a general, static dictionary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Takeda, Y., Umemura, K., Yamamoto, E.: Determining indexing strings with statistical analysis. IEICE Transactions on Information and Systems E86-D, 1781–1787 (2003)

    Google Scholar 

  2. Jin, H., Wong, K.: A chinese dictionary construction algorithm for information retrieval. ACM Transactions on Asian Language Information Processing 1, 281–296 (2002)

    Article  Google Scholar 

  3. Nie, J., Briscbois, M., Ren, X.: On chinese text retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 225–233. ACM Press, New York (1996)

    Chapter  Google Scholar 

  4. Lua, K., Gan, G.: An application of information theory in chinese word segmentation. Computer Processing of Chinese and Oriental Languages 40, 115–124 (1994)

    Google Scholar 

  5. Yang, C.C., Luk, J.W., Yung, S.K., Yen, J.: Combination and boundary detection approaches on chinese indexing. Journal of the American Society for Information Science and Technology (JASIST) 51, 340–351 (2000)

    Article  Google Scholar 

  6. Foo, S., Li, H.: Chinese word segmentation and its effects on information retrieval. Information Processing and Management 40, 161–190 (2004)

    Article  Google Scholar 

  7. Shimohata, S., Sugio, T.: Retrieving collocations by co-occurrences and word order constraints. In: Proceedings of the eighth Conference on European Chapter of the Association for Computational Linguistics, pp. 476–481 (1997)

    Google Scholar 

  8. Feng, F., Croft, W.: Probabilistic techniques for phrase extraction. Information Processing and Management 37, 199–200 (2001)

    Article  MATH  Google Scholar 

  9. Zhou, M., Tompa, F.: The suffix-signature method for searching phrase in text. Information System 23, 567–588 (1997)

    Article  Google Scholar 

  10. Khoo, C.S.G., Dai, Y., Loh, T.E.: Using statistical and contextual information to identify two- and three-character words in chinese text. Journal of the American Society for Information Science and Technology (JASIST) 53, 365–377 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xu, J., Ye, S., Li, X. (2004). Query Based Chinese Phrase Extraction for Site Search. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds) Web Information Systems – WISE 2004. WISE 2004. Lecture Notes in Computer Science, vol 3306. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30480-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30480-7_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23894-2

  • Online ISBN: 978-3-540-30480-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics