Skip to main content
Log in

Efficient storage and retrieval of probabilistic latent semantic information for information retrieval

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Probabilistic latent semantic analysis (PLSA) is a method for computing term and document relationships from a document set. The probabilistic latent semantic index (PLSI) has been used to store PLSA information, but unfortunately the PLSI uses excessive storage space relative to a simple term frequency index, which causes lengthy query times. To overcome the storage and speed problems of PLSI, we introduce the probabilistic latent semantic thesaurus (PLST); an efficient and effective method of storing the PLSA information. We show that through methods such as document thresholding and term pruning, we are able to maintain the high precision results found using PLSA while using a very small percent (0.15%) of the storage space of PLSI.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 35–42. ACM Press, New York (2001). http://doi.acm.org/10.1145/383952.383957

  2. Bast, H., Majumdar, D.: Why spectral retrieval works. In: SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 11–18. ACM Press, New York, NY, USA (2005). http://doi.acm.org.ezproxy.lib.unimelb.edu.au/10.1145/1076034.1076040

  3. Blei D.M., Ng A.Y. and Jordan M.I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res. 3: 993–1022

    Article  MATH  Google Scholar 

  4. Buckley, C.: Implementation of the smart information retrieval system. Tech. rep., Cornell University, Ithaca, NY, USA (1985)

  5. Buntine W. (2002). Variational extensions to em and multinomial pca. In: Toivonen, H. (eds) Proceedings of the 13th European Conference on Machine Learning, LNAI, vol. 2430., pp 23–44. Springer, Heidelberg

    Google Scholar 

  6. Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K. and Harshman R.A. (1990). Indexing by latent semantic analysis. J. the Am. Soc. Inf. Sc. 41: 391–407

    Article  Google Scholar 

  7. Ding C.H. (2005). A probabilistic model for latent semantic indexing. J. Am. Soc. Inf. Sci. Technol. 56(6): 597–608

    Article  Google Scholar 

  8. Dumais S.T. (1991). Improving the retrieval of information from external sources. Behav. Res. Methods Instrum. Comput. 23(2): 229–236

    Google Scholar 

  9. Dumais S.T. (2004). Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1): 188–230

    Article  Google Scholar 

  10. Farahat, A.O., Chen, F.R.: Improving probabilistic latent semantic analysis using principal component analysis. In: Eleventh Conference of the European Chapter of the Association for Computational Linguistics (EACL −2006) (2006)

  11. Gaussier, E., Goutte, C.: Relation between plsa and nmf and implications. In: SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 601–602. ACM Press, New York, NY, USA (2005). http://doi.acm.org/10.1145/1076034.1076148

  12. Girolami, M., Kaban, A.: On an equivalence between plsi and lda. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 433–434. ACM Press, New York (2003). http://doi.acm.org/10.1145/860435.860537

  13. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM Press, New York (1999). http://doi.acm.org/10.1145/312624.312649

  14. Hofmann T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2): 177–196

    Article  MATH  Google Scholar 

  15. Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments, part 2. Inf. Processing and Management 36(6), 809–840 (2000). http://dx.doi.org/10.1016/S0306-4573(00)00015-7

  16. Keller, M., Bengio, S.: A neural network for text representation. In: Duch, W., Kacprzyk, J., Oja, E., Zadrozny S. (eds) ICANN (2), Lecture Notes in Computer Science, vol. 3697, pp. 667–672. Springer, Heidelberg (2005)

  17. Kim, Y., Chang, J.H., Zhang, B.T.: An empirical study on dimensionality optimization in text mining for linguistic knowledge acquisition. In: Whang, K.Y., Jeon, J., Shim, K., Srivastava J. (eds) PAKDD, Lecture Notes in Computer Science, vol. 2637, pp. 111–116. Springer, Heidelberg (2003)

  18. Lloyd, R., Shakiban, C.: Improvements in latent semantic analysis. Am. J. Undergraduate Res. 3(2): (2004)

  19. Matveeva, I., Levow, G.A., Farahat, A., Royer, C.: Term representation with generalized latent semantic analysis. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-05). Borovets, Bulgaria (2005)

  20. Nakov, P., Popova, A., Mateev, P.: Weight functions impact on lsa performance. In: Proceedings of the EuroConference Recent Advances in Natural Language Processing, pp. 187–193. Tzigov Chark, Bulgaria (2001)

  21. Park, L.A.F., Ramamohanarao, K.: Hybrid pre-query term expansion using latent semantic analysis. In: Rastogi, R., Morik, K., Bramer, M., Wu, X. (eds) The Fourth IEEE International Conference on Data Mining, pp. 178–185. IEEE Computer Society, Los Alamitos, CA (2004). doi: 10.1109/ICDM.2004.10085. http://www.cs.mu.oz.au/~lapark/park_hybridlsa2004.pdf

  22. Park, L.A.F., Ramamohanarao, K.: Query expansion using a collection dependent probabilistic latent semantic thesaurus. In: Zhou, Z.H., Li, H., Yang, Q. (eds) PAKDD, Lecture Notes in Computer Science, vol. 4426, pp. 224–235. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-71701-0_24

  23. Robertson, S.E., Walker, S., Jones, S., Hancock-BeauLieu, M.M., Gatford, M.: Okapi at TREC-3. In: Harman, D. (ed.) The Third Text REtrieval Conference (TREC-3), pp. 109–126. National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Md. 20899 (1994)

  24. Song, F., Croft, W.B.: A general language model for information retrieval. In: CIKM 1999: Proceedings of the Eighth International Conference on Information and Knowledge Management, pp. 316–321. ACM Press, New York (1999). http://doi.acm.org/10.1145/319950.320022

  25. Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic author-topic models for information discovery. In: KDD 2004: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 306–315. ACM Press, New York, NY, USA (2004). http://doi.acm.org/10.1145/1014052.1014087

  26. Voorhees, E.M., Buckland, L.P.: Common evaluation measures. In: The Fifteenth Text REtrieval Conference Proceedings (TREC 2006) (2006)

  27. Witten I.H., Moffat A. and Bell T.C. (1999). Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, San Francisco

    Google Scholar 

  28. Xu, B., Lu, J., Huang, G.: A constrained non-negative matrix factorization in information retrieval. In: IEEE International Conference on Information Reuse and Integration, pp. 273–277 (2003)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laurence A. F. Park.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, L.A.F., Ramamohanarao, K. Efficient storage and retrieval of probabilistic latent semantic information for information retrieval. The VLDB Journal 18, 141–155 (2009). https://doi.org/10.1007/s00778-008-0093-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-008-0093-2

Keywords

Navigation