Skip to main content

The Effect of Weighted Term Frequencies on Probabilistic Latent Semantic Term Relationships

  • Conference paper
String Processing and Information Retrieval (SPIRE 2008)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5280))

Included in the following conference series:

Abstract

Probabilistic latent semantic analysis (PLSA) is a method of calculating term relationships within a document set using term frequencies. It is well known within the information retrieval community that raw term frequencies contain various biases that affect the precision of the retrieval system. Weighting schemes, such as BM25, have been developed in order to remove such biases and hence improve the overall quality of results from the retrieval system. We hypothesised that the biases found within raw term frequencies also affect the calculation of term relationships performed during PLSA. By using portions of the BM25 probabilistic weighting scheme, we have shown that applying weights to the raw term frequencies before performing PLSA leads to a significant increase in precision at 10 documents and average reciprocal rank. When using the BM25 weighted PLSA information in the form of a thesaurus, we achieved an average 8% increase in precision. Our thesaurus method was also compared to pseudo-relevance feedback and a co-occurrence thesaurus, both using BM25 weights. Precision results showed that the probabilistic latent semantic thesaurus using BM25 weights outperformed each method in terms of precision at 10 documents and average reciprocal rank.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Buckley, C., Walz, J.: SMART in TREC 8. In: Voorhees, Harman (eds.) [11], pp. 577–582

    Google Scholar 

  2. Dumais, S.T.: Improving the retrieval of information from external sources. Behaviour Research Methods, Instruments & Computers 23(2), 229–236 (1991)

    Article  Google Scholar 

  3. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57. ACM Press, New York (1999)

    Google Scholar 

  4. Sparck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments, part 2. Information Processing and Management 36(6), 809–840 (2000)

    Article  Google Scholar 

  5. Park, L.A.F., Ramamohanarao, K.: Hybrid pre-query term expansion using latent semantic analysis. In: The Fourth IEEE International Conference on Data Mining, November 2004, pp. 178–185. IEEE Computer Society, Los Alamitos (2004)

    Chapter  Google Scholar 

  6. Park, L.A.F., Ramamohanarao, K.: Query expansion using a collection dependent probabilistic latent semantic thesaurus. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 224–235. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  7. Park, L.A.F., Ramamohanarao, K.: An analysis of latent semantic indexing term self preservation. ACM Transactions on Information Systems (to appear, 2008)

    Google Scholar 

  8. Park, L.A.F., Ramamohanarao, K.: Efficient storage and retrieval of probabilistic latent semantic information for information retrieval. The International Journal on Very Large Data Bases (to appear, 2008)

    Google Scholar 

  9. Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th International Conference on Research and Development in Information Retrieval, London, pp. 232–241. Association of Computing Machinary, Inc., Springer, Heidelberg (1994)

    Google Scholar 

  10. Robertson, S.E., Walker, S.: Okapi/keenbow at TREC-8. In: Voorhees, Harman (eds.) [11], pp. 151–162

    Google Scholar 

  11. Voorhees, E.M., Harman, D.K. (eds.): The Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Md. 20899, National Institute of Standards and Technology Special Publication 500-246, Department of Commerce, National Institute of Standards and Technology (November 1999)

    Google Scholar 

  12. Voorhees, E.M., Harman, D.K.: Overview of the eighth text retrieval conference (TREC-8). In: The Eighth Text REtrieval Conference (TREC-8) [11], pp. 1–23

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Park, L.A.F., Ramamohanarao, K. (2008). The Effect of Weighted Term Frequencies on Probabilistic Latent Semantic Term Relationships. In: Amir, A., Turpin, A., Moffat, A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89097-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-89097-3_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-89096-6

  • Online ISBN: 978-3-540-89097-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics