ABSTRACT
In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams have been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carry out more experiments on different ways to segment documents and queries, and to combine words with n-grams. Our experiments show that a combination of the longest-matching algorithm with single characters is the best choice.
- 1.Buckley, C. Implementation of the SMART information retrieval system, Technical report, #85- 686, Cornell University, 1985. Google ScholarDigital Library
- 2.Chen, K.-J. and Kiu, S.-H. Word identification for Mandarin Chinese sentences. 5th International Conference on Computational Linguistics, 1992. pp. 101-107. Google ScholarDigital Library
- 3.Harman, D. K. and Voorhees, E. M., Eds. Information Technology: The Fifth Text REtrieval Conference (TREC-5), NIST SP 500-238. Gaithersburg, National Institute of Standards and Technology, 1996.Google Scholar
- 4.Kwok, K. L. Comparing representations in Chinese information retrieval. Conference on Research and Development in Information Retrieval, ACM-SIGIR, 1997, pp. 34-41. Google ScholarDigital Library
- 5.Kwok, K.L. and Grunfeld, L. TREC-5 English and Chinese retrieval experiments using PIRCS, The Fifth Text Retrieval Conference (TREC-5), NIST special publication 500-238, 1997, pp. 133-142.Google Scholar
- 6.Lee, J. H. Combining multiple evidence from different properties of weighting schemes. Conference on Research and Development in Information Retrieval, ACM-SIGIR, Seattle, 1995, pp. 180-188. Google ScholarDigital Library
- 7.Lee, J. H. and Ahn, J. S. Using n-grams for Korean text retrieval. Conference on Research and Development in Information Retrieval, ACM-SIGIR, Zurich, (996, pp. 216-224. Google ScholarDigital Library
- 8.Leong, M.-K. and Zhou, H. Preliminary qualitative analysis of segmented vs. bigram indexing in Chinese, The Sixth Text Retrieval Conference (TREC-6), NIST special publication 500-240, 1998, pp. 551-557.Google Scholar
- 9.Li, B.-Y., Lien, S., Sun, C.-F. and Sun, M.-S. A maximal matching automatic Chinese word segmentation algorithm using corpus tagging for ambiguity resolution. R.O.C. Computational Linguistics Conference (ROCLING-1V), Taiwan, 1991, pp. 135-146.Google Scholar
- 10.Nie, J.-Y., Brisebois, M. and Ren, X. On Chinese text retrieval. Conference on Research and Development in Information Retrieval, ACM-SIGIR, Zurich, 1996, pp. 225-233. Google ScholarDigital Library
- 11.Nie, J.-Y., Ren, F. Chinese information retrieval: using characters or words? Information Processing and Management, 1999, 35: 443-462.Google ScholarCross Ref
- 12.Ogawa, Y. A new character-based indexing organization using frequency data for Japanese documents. Conference on Research and Development in Information Retrieval, ACM-SIGIR, Seattle, 1995, pp. 121-129. Google ScholarDigital Library
- 13.Strzalkowski, T., Lin, F. and Perez-CarbaUo, J. Natural language information retrieval TREC-6 report, The Sixth Text Retrieval Conference (TREC- 6), NIST special publication 500-240, 1998, pp. 347- 366.Google Scholar
- 14.Yao, T.-S., Zhang, G.-P. and Wu, Y.-M. A rulebased Chinese automatic segmentation system. Journal of Chinese Information Processing, 1990, 4(1): 37-43.Google Scholar
- 15.Yeh, C.-L. and Lee, H.-J. Rule-based word identification for Mandarin Chinese sentences - A unification approach. Computer processing of Chinesse and Oriental Languages, 1991, 5(2): 97-118.Google Scholar
- On the use of words and n-grams for Chinese information retrieval
Recommendations
Two approaches for the resolution of word mismatch problem caused by English words and foreign words in Korean information retrieval
IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languagesIn Korean text, recently, the use of English words with or without phonetic translation is growing at high speed. To make matters worse the Korean transliterations of an English word may be very various. The mixed use of English words and their various ...
Chinese word segmentation and its effect on information retrieval
A set of IR experiments was carried out to study the impact of Chinese word segmentation and its effect on information retrieval (IR) at the Division of Information Studies, Nanyang Technological University, Singapore. A total of four automatic ...
Research on English-Chinese bi-directional cross-language information retrieval
Proceedings of the 2005 joint Chinese-German conference on Cognitive systemsWith the rapid growing amount of information available to us, the situations that a user needs to use a retrieval system to perform querying a multilingual document collection are becoming increasingly emerging and common. Thus an important problem is ...
Comments