Abstract
To evaluate document-to-document relevance is very important to many advanced applications such as IR, text mining and natural language processing. Since it is very hard to define document relevance in a mathematic way on account of users’ uncertainty, the concept of topical relevance is widely accepted by most of research fields. It suggests that a document relevance model should explain whether the document representation describes its topical contents and the matching method reveals the topical differences among the documents. However, the current document-to-document relevance models, such as vector space model, string distance, don’t put explicitly emphasis on the perspective of topical relevance. This paper exploits a document language model to represent the document topical content and explains why it can reveal the document topics and then establishes two distributional similarity measure based on the document language model to evaluate document-to-document relevance. The experiment on the TREC testing collection is made to compare it with the vector space model, and the results show that the Kullback-Leibler divergence measure with Jelinek-Mercer smoothing outperforms the vector space model significantly.
Supported by the National Natural Science Foundation of China under Grant No.60173051 and the Teaching and Research Award Program for Outstanding Young Teachers in Higher Education Institution of the Ministry of Education, China.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abdollahzadeh, A.: Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition. Computer Science Dept., University of Maryland Techni-cal Reports TR-CS-4291 (2002)
Saracevic, T.: Relevance Reconsidered. In: Ingwersen, P., Pors, N.O. (eds.) Information Science: Integration in Perspective (1996)
Wang, J.: The Relevance in Information Retrieval. Modern Foreign Languages 24(2) (2001)
Zhai, C., Lafferty, J.: A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. In: Proc. of SIGIR 2001 (2001)
Gibson, W.: Pattern Recognition. Academic Press, London (2003)
Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of SIGIR 1998 (1998)
Miller, D., Leek, T., Schwartz, R.M.: A hidden Markov Model Information Retrieval System. In: Proc. of SIGIR 1999 (1999)
Zaragoza, H., Hiemstra, D., Tipping, M.: Bayesian Extension to The Lan-guage Model for Ad Hoc Information Retrieval. In: Proc. of SIGIR 2003 (2003)
Lafferty, J., Zhai, C.: Document Language Models, Query Models, and Risk Minimiza-tion for Information Retrieval. In: Proc. of SIGIR 2001 (2001)
Berger, A., Lafferty, J.: Information Retrieval as Statistical Translation. In: Proc. of SIGIR 1999 (1999)
Levenshtein, V.I.: Binary Codes Capable of Correcting Spurious Insertions and Deletions of Ones. Problems of Information Transmission (1965)
Yianilos, P.: Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. In: Proc. of the 4th ACM-SIAM Symposium on Discrete Algorithms (1993)
Yianilos, P.: The Likeit Intelligent String Comparison Facility. NEC Institute Tech. Report 97-093 (1997)
Salton, G.: The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall Inc., Englewood Cliffs (1971)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yu, G., Li, X., Bao, Y., Wang, D. (2005). Evaluating Document-to-Document Relevance Based on Document Language Model: Modeling, Implementation and Performance Evaluation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_63
Download citation
DOI: https://doi.org/10.1007/978-3-540-30586-6_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)