Efficient storage and retrieval of probabilistic latent semantic information for information retrieval

Park, Laurence A. F.; Ramamohanarao, Kotagiri

doi:10.1007/s00778-008-0093-2

Efficient storage and retrieval of probabilistic latent semantic information for information retrieval

Regular Paper
Published: 28 February 2008

Volume 18, pages 141–155, (2009)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Laurence A. F. Park¹ &
Kotagiri Ramamohanarao¹

228 Accesses
18 Citations
3 Altmetric
Explore all metrics

Abstract

Probabilistic latent semantic analysis (PLSA) is a method for computing term and document relationships from a document set. The probabilistic latent semantic index (PLSI) has been used to store PLSA information, but unfortunately the PLSI uses excessive storage space relative to a simple term frequency index, which causes lengthy query times. To overcome the storage and speed problems of PLSI, we introduce the probabilistic latent semantic thesaurus (PLST); an efficient and effective method of storing the PLSA information. We show that through methods such as document thresholding and term pruning, we are able to maintain the high precision results found using PLSA while using a very small percent (0.15%) of the storage space of PLSI.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Vivek Mehta, Mohit Agarwal & Rohit Kumar Kaliyar

Sparse Principal Component Analysis for Natural Language Processing

Article Open access 18 May 2020

Reza Drikvandi & Olamide Lawal

Data dependencies for query optimization: a survey

Article Open access 14 June 2021

Jan Kossmann, Thorsten Papenbrock & Felix Naumann

References

Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 35–42. ACM Press, New York (2001). http://doi.acm.org/10.1145/383952.383957
Bast, H., Majumdar, D.: Why spectral retrieval works. In: SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 11–18. ACM Press, New York, NY, USA (2005). http://doi.acm.org.ezproxy.lib.unimelb.edu.au/10.1145/1076034.1076040
Blei D.M., Ng A.Y. and Jordan M.I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res. 3: 993–1022
Article MATH Google Scholar
Buckley, C.: Implementation of the smart information retrieval system. Tech. rep., Cornell University, Ithaca, NY, USA (1985)
Buntine W. (2002). Variational extensions to em and multinomial pca. In: Toivonen, H. (eds) Proceedings of the 13th European Conference on Machine Learning, LNAI, vol. 2430., pp 23–44. Springer, Heidelberg
Google Scholar
Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K. and Harshman R.A. (1990). Indexing by latent semantic analysis. J. the Am. Soc. Inf. Sc. 41: 391–407
Article Google Scholar
Ding C.H. (2005). A probabilistic model for latent semantic indexing. J. Am. Soc. Inf. Sci. Technol. 56(6): 597–608
Article Google Scholar
Dumais S.T. (1991). Improving the retrieval of information from external sources. Behav. Res. Methods Instrum. Comput. 23(2): 229–236
Google Scholar
Dumais S.T. (2004). Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1): 188–230
Article Google Scholar
Farahat, A.O., Chen, F.R.: Improving probabilistic latent semantic analysis using principal component analysis. In: Eleventh Conference of the European Chapter of the Association for Computational Linguistics (EACL −2006) (2006)
Gaussier, E., Goutte, C.: Relation between plsa and nmf and implications. In: SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 601–602. ACM Press, New York, NY, USA (2005). http://doi.acm.org/10.1145/1076034.1076148
Girolami, M., Kaban, A.: On an equivalence between plsi and lda. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 433–434. ACM Press, New York (2003). http://doi.acm.org/10.1145/860435.860537
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM Press, New York (1999). http://doi.acm.org/10.1145/312624.312649
Hofmann T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2): 177–196
Article MATH Google Scholar
Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments, part 2. Inf. Processing and Management 36(6), 809–840 (2000). http://dx.doi.org/10.1016/S0306-4573(00)00015-7
Keller, M., Bengio, S.: A neural network for text representation. In: Duch, W., Kacprzyk, J., Oja, E., Zadrozny S. (eds) ICANN (2), Lecture Notes in Computer Science, vol. 3697, pp. 667–672. Springer, Heidelberg (2005)
Kim, Y., Chang, J.H., Zhang, B.T.: An empirical study on dimensionality optimization in text mining for linguistic knowledge acquisition. In: Whang, K.Y., Jeon, J., Shim, K., Srivastava J. (eds) PAKDD, Lecture Notes in Computer Science, vol. 2637, pp. 111–116. Springer, Heidelberg (2003)
Lloyd, R., Shakiban, C.: Improvements in latent semantic analysis. Am. J. Undergraduate Res. 3(2): (2004)
Matveeva, I., Levow, G.A., Farahat, A., Royer, C.: Term representation with generalized latent semantic analysis. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-05). Borovets, Bulgaria (2005)
Nakov, P., Popova, A., Mateev, P.: Weight functions impact on lsa performance. In: Proceedings of the EuroConference Recent Advances in Natural Language Processing, pp. 187–193. Tzigov Chark, Bulgaria (2001)
Park, L.A.F., Ramamohanarao, K.: Hybrid pre-query term expansion using latent semantic analysis. In: Rastogi, R., Morik, K., Bramer, M., Wu, X. (eds) The Fourth IEEE International Conference on Data Mining, pp. 178–185. IEEE Computer Society, Los Alamitos, CA (2004). doi: 10.1109/ICDM.2004.10085. http://www.cs.mu.oz.au/~lapark/park_hybridlsa2004.pdf
Park, L.A.F., Ramamohanarao, K.: Query expansion using a collection dependent probabilistic latent semantic thesaurus. In: Zhou, Z.H., Li, H., Yang, Q. (eds) PAKDD, Lecture Notes in Computer Science, vol. 4426, pp. 224–235. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-71701-0_24
Robertson, S.E., Walker, S., Jones, S., Hancock-BeauLieu, M.M., Gatford, M.: Okapi at TREC-3. In: Harman, D. (ed.) The Third Text REtrieval Conference (TREC-3), pp. 109–126. National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Md. 20899 (1994)
Song, F., Croft, W.B.: A general language model for information retrieval. In: CIKM 1999: Proceedings of the Eighth International Conference on Information and Knowledge Management, pp. 316–321. ACM Press, New York (1999). http://doi.acm.org/10.1145/319950.320022
Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic author-topic models for information discovery. In: KDD 2004: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 306–315. ACM Press, New York, NY, USA (2004). http://doi.acm.org/10.1145/1014052.1014087
Voorhees, E.M., Buckland, L.P.: Common evaluation measures. In: The Fifteenth Text REtrieval Conference Proceedings (TREC 2006) (2006)
Witten I.H., Moffat A. and Bell T.C. (1999). Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, San Francisco
Google Scholar
Xu, B., Lu, J., Huang, G.: A constrained non-negative matrix factorization in information retrieval. In: IEEE International Conference on Information Reuse and Integration, pp. 273–277 (2003)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Software Engineering, ARC Centre for Perceptive and Intelligent Machines in Complex Environments, The University of Melbourne, Melbourne, Australia
Laurence A. F. Park & Kotagiri Ramamohanarao

Authors

Laurence A. F. Park
View author publications
You can also search for this author in PubMed Google Scholar
Kotagiri Ramamohanarao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laurence A. F. Park.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, L.A.F., Ramamohanarao, K. Efficient storage and retrieval of probabilistic latent semantic information for information retrieval. The VLDB Journal 18, 141–155 (2009). https://doi.org/10.1007/s00778-008-0093-2

Download citation

Received: 05 March 2007
Revised: 23 December 2007
Accepted: 02 January 2008
Published: 28 February 2008
Issue Date: January 2009
DOI: https://doi.org/10.1007/s00778-008-0093-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient storage and retrieval of probabilistic latent semantic information for information retrieval

Abstract

Access this article

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

Sparse Principal Component Analysis for Natural Language Processing

Data dependencies for query optimization: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient storage and retrieval of probabilistic latent semantic information for information retrieval

Abstract

Access this article

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

Sparse Principal Component Analysis for Natural Language Processing

Data dependencies for query optimization: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation