Abstract
Both English and Chinese ad-hoc information retrieval were investigated in this Tipster 3 project. Part of our objectives is to study the use of various term level and phrasal level evidence to improve retrieval accuracy. For short queries, we studied five term level techniques that together can lead to good improvements over standard ad-hoc 2-stage retrieval for TREC5-8 experiments. For long queries, we studied the use of linguistic phrases to re-rank retrieval lists. Its effect is small but consistently positive.
For Chinese IR, we investigated three simple representations for documents and queries: short-words, bigrams and characters. Both approximate short-word segmentation or bigrams, augmented with characters, give highly effective results. Accurate word segmentation appears not crucial for overall result of a query set. Character indexing by itself is not competitive. Additional improvements may be obtained using collection enrichment and combination of retrieval lists.
Our PIRCS document-focused retrieval is also shown to have similarity with a simple language model approach to IR.
Article PDF
Similar content being viewed by others
References
Attar R and Frankel AS (1977) Local feedback in full-text retrieval systems. J. of the ACM, 24(3):397-417.
Boisen S, Crystal M, Petersen E, Weischedel R, Broglio J, Callan J, Croft B, Hand T, Keenan T, and Okurowski M (1996) Chinese information extraction & retrieval. In: Proceedings of Tipster Text Program (Phase 2), Sept. 1996, pp. 109-119.
Cavnar WB (1995) Using an n-gram-based document representation with a vector processing retrieval model. In: Harman DK, Ed., Overview of the Third Text REtrieval Conference (TREC-3). NIST Special Publication 500-234, GPO, Washington, DC, pp. 269-277.
Chien LF (1995) Fast and quasi-natural language search for gigabytes of Chinese texts. In: Proc. of 18th Ann. Intl. ACM SIGIR Conf. on R&D in IR. pp. 21-28.
Chen A, He J, Xu L, Gey F and Meggs J (1997) Chinese text retrieval without using a dictionary. In: Proc. of 20st Ann. Intl. ACM SIGIR Conf. on R&D in IR. pp. 42-49.
Croft WB and Harper D (1979) Using probabilistic models of information retrieval without relevance information. J. of Documentation, 35:285-295.
Damashek M (1995) Gauging similarity via n-grams: Language independent categorization of text. Science, 246:843-848.
Fagan JL (1987) Experiments in automatic phrase indexing for document retrieval: A comparison of syntactic and non-syntactic methods. PhD Thesis, Department of Computer Science, Cornell Univeristy, TR 87-868.
Frakes WB and Baeza-Yates R (Eds.) (1992) Information Retrieval-Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ.
Hiemstra D and Kraaij W (1999) TREC-7 working notes: Twenty-one in ad-hoc and CLIR. In: Voorhees E and Harman DK, Eds., The Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242, GPO, Washington, DC, pp. 133-142.
Kwok KL (1990) Experiments with a component theory of probabilistic information retrieval based on single terms as document components. ACM Transactions on Office Information System, 8:363-386.
Kwok KL (1995) A network approach to probabilistic information retrieval. ACM Transactions on Office Information System, 13:324-353.
Kwok KL (1996) A new method of weighting query terms for ad-hoc retrieval. In: Proc. 19th Annual Intl. ACM SIGIR Conf. on R&D in IR. pp. 187-195.
Kwok, K.L. (1997) Lexicon effects on Chinese information retrieval. Proc. of 2nd Conf. on Empirical Methods in NLP, pp. 141-148.
Kwok KL (1999) Employing multiple repesentations for Chinese information retrieval. J. of the American Society for Information Science, 50(8):709-723.
Kwok KL and Chan M (1998) Improving two-stage ad-hoc retrieval for short queries. In: Proc. 21st Ann. Intl. ACM SIGIR Conf. on R&D in IR. pp. 250-256.
Kwok KL and Grunfeld L (1997) TREC-5 English and Chinese retrieval experiments using PIRCS. In: Voorhees EM and Harman DK, Eds., Information Technology: The Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500-238. GPO, Washington, DC, pp. 133-142.
Kwok KL, Grunfeld L and Chan M (20xx) TREC-8 ad-hoc, query and filtering track experiments using PIRCS. To be published by NIST.
Kwok KL, Grunfeld L and Xu JH (1998) TREC-6 English and Chinese retrieval experiments using PIRCS. In: Voorhees E and Harman DK, Eds., The Sixth Text REtrieval Conference (TREC-6). NIST Special Publication 500-240, GPO, Washington, DC, pp. 207-214.
Miller DRH, Leek T and Schwartz RM (1999) A hidden Markov model information retrieval system. In Proc. 22nd Ann. Intl. ACM SIGIR Conf. on R&D in IR. pp. 214-221.
Ponte J and Croft WB (1996) Useg: a retargetable word segmentation procedure for information retrieval. In: Symposium on Document Analysis & Information Retrieval (SDAIR 1996).
Ponte J and Croft WB (1998) A language modeling approach to information retrieval. In: Proc. 21st Ann. Intl. ACM SIGIR Conf. on R&D in IR. pp. 275-281.
Robertson SE and Sparck Jones K (1976) Relevance weighting of search terms. J. of American Society for Information Science, 27:129-146.
Robertson SE and Walker S (1997) On relevance weights with little relevance information. In: Proc. 20st Ann. Intl. ACM SIGIR Conf. on R&D in IR. pp. 16-24.
Ruge G (1992) Experiments on linguistically-based term association. Info. Proc. and Mngmt, 28:317-332.
Salton G and Buckley C (1990) Improving retrieval performance by relevance feedback. J. of American Society for Information Science, 41(4), 288-97.
Singhal A, Mitra M and Buckley C (1997) Learning routing queries in a query zone. In: Proc. 20st Ann. Intl. ACM SIGIR Conf. on R&D in IR. pp. 25-33.
Smeaton AF, O'Donnell R and Kelledy F (1995) Indexing structures derived from syntax-TREC3 system description. In: Harman DK, Ed., Overview of The Third Text REtrieval Conference (TREC-3). NIST Special Publication 500-225, GPO, Washington, DC, pp. 55-68.
Sproat R, Shih C, Gale W and Chang N (1996) A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22:377-404.
Strzalkowski T and Carballo JP (1996) Natural language information retrieval: TREC-4 report. In: Harman DK, Ed., The Fourth Text REtrieval Conference (TREC-4), NIST Special Publication 500-236, GPO, Washington, DC, pp. 245-258.
Tague-Sutcliffe J and Blustein J (1995) A statistical analysis of the TREC-3 data. In: Harman DK, Ed., Overview of The Third Text REtrieval Conference (TREC-3), NIST Special Publication 500-225, GPO, Washington, DC, pp. 385-398.
Voorhees E and Harman D (1997) Overview of the Fifth Text REtrieval Conference (TREC-5). In: Voorhees E and Harman DK, Eds., Information Technology: The Sixth Text REtrieval Conference (TREC-5), NIST Special Publication 500-238, GPO, Washington, DC, pp. 1-28.
Voorhees E and Harman D (1998) Overview of the Sixth Text REtrieval Conference (TREC-6). In: Voorhees E and Harman DK, Eds., Information Technology: The Sixth Text REtrieval Conference (TREC-6), NIST Special Publication 500-240, GPO, Washington, DC, pp. 1-24.
Wu Z and Tseng G (1995) ACTS: An automatic Chinese text segmentation system for full text retrieval. Journal of the American Society for Information Science, 46:83-96.
Xu J and Croft WB (1996) Query expansion using lcoal and global document analysis. In: Proc. 19st Ann. Intl. ACM SIGIR Conf. on R&D in IR. pp. 4-11.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kwok, K. Improving English and Chinese Ad-Hoc Retrieval: A Tipster Text Phase 3 Project Report. Information Retrieval 3, 313–338 (2000). https://doi.org/10.1023/A:1009955715597
Issue Date:
DOI: https://doi.org/10.1023/A:1009955715597