Abstract
Probabilistic latent semantic analysis (PLSA) is a method of calculating term relationships within a document set using term frequencies. It is well known within the information retrieval community that raw term frequencies contain various biases that affect the precision of the retrieval system. Weighting schemes, such as BM25, have been developed in order to remove such biases and hence improve the overall quality of results from the retrieval system. We hypothesised that the biases found within raw term frequencies also affect the calculation of term relationships performed during PLSA. By using portions of the BM25 probabilistic weighting scheme, we have shown that applying weights to the raw term frequencies before performing PLSA leads to a significant increase in precision at 10 documents and average reciprocal rank. When using the BM25 weighted PLSA information in the form of a thesaurus, we achieved an average 8% increase in precision. Our thesaurus method was also compared to pseudo-relevance feedback and a co-occurrence thesaurus, both using BM25 weights. Precision results showed that the probabilistic latent semantic thesaurus using BM25 weights outperformed each method in terms of precision at 10 documents and average reciprocal rank.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Buckley, C., Walz, J.: SMART in TREC 8. In: Voorhees, Harman (eds.) [11], pp. 577–582
Dumais, S.T.: Improving the retrieval of information from external sources. Behaviour Research Methods, Instruments & Computers 23(2), 229–236 (1991)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57. ACM Press, New York (1999)
Sparck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments, part 2. Information Processing and Management 36(6), 809–840 (2000)
Park, L.A.F., Ramamohanarao, K.: Hybrid pre-query term expansion using latent semantic analysis. In: The Fourth IEEE International Conference on Data Mining, November 2004, pp. 178–185. IEEE Computer Society, Los Alamitos (2004)
Park, L.A.F., Ramamohanarao, K.: Query expansion using a collection dependent probabilistic latent semantic thesaurus. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 224–235. Springer, Heidelberg (2007)
Park, L.A.F., Ramamohanarao, K.: An analysis of latent semantic indexing term self preservation. ACM Transactions on Information Systems (to appear, 2008)
Park, L.A.F., Ramamohanarao, K.: Efficient storage and retrieval of probabilistic latent semantic information for information retrieval. The International Journal on Very Large Data Bases (to appear, 2008)
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th International Conference on Research and Development in Information Retrieval, London, pp. 232–241. Association of Computing Machinary, Inc., Springer, Heidelberg (1994)
Robertson, S.E., Walker, S.: Okapi/keenbow at TREC-8. In: Voorhees, Harman (eds.) [11], pp. 151–162
Voorhees, E.M., Harman, D.K. (eds.): The Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Md. 20899, National Institute of Standards and Technology Special Publication 500-246, Department of Commerce, National Institute of Standards and Technology (November 1999)
Voorhees, E.M., Harman, D.K.: Overview of the eighth text retrieval conference (TREC-8). In: The Eighth Text REtrieval Conference (TREC-8) [11], pp. 1–23
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Park, L.A.F., Ramamohanarao, K. (2008). The Effect of Weighted Term Frequencies on Probabilistic Latent Semantic Term Relationships. In: Amir, A., Turpin, A., Moffat, A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89097-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-89097-3_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89096-6
Online ISBN: 978-3-540-89097-3
eBook Packages: Computer ScienceComputer Science (R0)