Query Based Chinese Phrase Extraction for Site Search

Xu, Jingfang; Ye, Shaozhi; Li, Xing

doi:10.1007/978-3-540-30480-7_14

Jingfang Xu²¹,
Shaozhi Ye²¹ &
Xing Li²¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3306))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1170 Accesses

Abstract

Word segmentation(WS) is one of the major issues of information processing in character-based languages, for there are no explicit word boundaries in these languages. Moreover, a combination of multiple continuous words, a phrase, is usually a minimum meaningful unit. Although much work has been done on WS, in site web search, little has been explored to mine site-specific knowledge from user query log for both more accurate WS and better retrieval performance. This paper proposes a novel, statistics-based method to extract phrases based on user query log. The extracted phrases, combined with a general, static dictionary, construct a dynamic, site-specific dictionary. According to the dictionary, web documents are segmented into phrases and words, which are kept as separate index terms to build phrase enhanced index for site search. The experiment result shows that our approach greatly improves the retrieval performance. It also helps to detect many out-of-vocabulary words, such as site-specific phrases, newly created words and names of people and locations, which are difficult to process with a general, static dictionary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Takeda, Y., Umemura, K., Yamamoto, E.: Determining indexing strings with statistical analysis. IEICE Transactions on Information and Systems E86-D, 1781–1787 (2003)
Google Scholar
Jin, H., Wong, K.: A chinese dictionary construction algorithm for information retrieval. ACM Transactions on Asian Language Information Processing 1, 281–296 (2002)
Article Google Scholar
Nie, J., Briscbois, M., Ren, X.: On chinese text retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 225–233. ACM Press, New York (1996)
Chapter Google Scholar
Lua, K., Gan, G.: An application of information theory in chinese word segmentation. Computer Processing of Chinese and Oriental Languages 40, 115–124 (1994)
Google Scholar
Yang, C.C., Luk, J.W., Yung, S.K., Yen, J.: Combination and boundary detection approaches on chinese indexing. Journal of the American Society for Information Science and Technology (JASIST) 51, 340–351 (2000)
Article Google Scholar
Foo, S., Li, H.: Chinese word segmentation and its effects on information retrieval. Information Processing and Management 40, 161–190 (2004)
Article Google Scholar
Shimohata, S., Sugio, T.: Retrieving collocations by co-occurrences and word order constraints. In: Proceedings of the eighth Conference on European Chapter of the Association for Computational Linguistics, pp. 476–481 (1997)
Google Scholar
Feng, F., Croft, W.: Probabilistic techniques for phrase extraction. Information Processing and Management 37, 199–200 (2001)
Article MATH Google Scholar
Zhou, M., Tompa, F.: The suffix-signature method for searching phrase in text. Information System 23, 567–588 (1997)
Article Google Scholar
Khoo, C.S.G., Dai, Y., Loh, T.E.: Using statistical and contextual information to identify two- and three-character words in chinese text. Journal of the American Society for Information Science and Technology (JASIST) 53, 365–377 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering, Tsinghua University, Beijing, 100084, P.R.China
Jingfang Xu, Shaozhi Ye & Xing Li

Authors

Jingfang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shaozhi Ye
View author publications
You can also search for this author in PubMed Google Scholar
Xing Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of ITEE, The University of Queensland, Australia
Xiaofang Zhou
Database Systems Research and Development Center, University of Florida, P.O. Box 116125, 470 CSE, 32601-6125, Gainesville, FL, USA
Stanley Su
INFOLAB, Dept. of Information Systems and Management, Tilburg University, The Netherlands
Mike P. Papazoglou
Polish-Japanese Institute of Information Technology, Faculty of IT, Ul. Koszykowa 86, 02-008, Warsaw, Poland
Maria Elzbieta Orlowska
Rutherford Appleton Laboratory, Science and Technology Facilities Council, Harwell Science and Innovation Campus, OX11 0QX, Didcot, UK
Keith Jeffery

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, J., Ye, S., Li, X. (2004). Query Based Chinese Phrase Extraction for Site Search. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds) Web Information Systems – WISE 2004. WISE 2004. Lecture Notes in Computer Science, vol 3306. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30480-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-540-30480-7_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23894-2
Online ISBN: 978-3-540-30480-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics