Abstract
Probabilistic latent semantic analysis (PLSA) is a method for computing term and document relationships from a document set. The probabilistic latent semantic index (PLSI) has been used to store PLSA information, but unfortunately the PLSI uses excessive storage space relative to a simple term frequency index, which causes lengthy query times. To overcome the storage and speed problems of PLSI, we introduce the probabilistic latent semantic thesaurus (PLST); an efficient and effective method of storing the PLSA information. We show that through methods such as document thresholding and term pruning, we are able to maintain the high precision results found using PLSA while using a very small percent (0.15%) of the storage space of PLSI.
Similar content being viewed by others
References
Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 35–42. ACM Press, New York (2001). http://doi.acm.org/10.1145/383952.383957
Bast, H., Majumdar, D.: Why spectral retrieval works. In: SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 11–18. ACM Press, New York, NY, USA (2005). http://doi.acm.org.ezproxy.lib.unimelb.edu.au/10.1145/1076034.1076040
Blei D.M., Ng A.Y. and Jordan M.I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res. 3: 993–1022
Buckley, C.: Implementation of the smart information retrieval system. Tech. rep., Cornell University, Ithaca, NY, USA (1985)
Buntine W. (2002). Variational extensions to em and multinomial pca. In: Toivonen, H. (eds) Proceedings of the 13th European Conference on Machine Learning, LNAI, vol. 2430., pp 23–44. Springer, Heidelberg
Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K. and Harshman R.A. (1990). Indexing by latent semantic analysis. J. the Am. Soc. Inf. Sc. 41: 391–407
Ding C.H. (2005). A probabilistic model for latent semantic indexing. J. Am. Soc. Inf. Sci. Technol. 56(6): 597–608
Dumais S.T. (1991). Improving the retrieval of information from external sources. Behav. Res. Methods Instrum. Comput. 23(2): 229–236
Dumais S.T. (2004). Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1): 188–230
Farahat, A.O., Chen, F.R.: Improving probabilistic latent semantic analysis using principal component analysis. In: Eleventh Conference of the European Chapter of the Association for Computational Linguistics (EACL −2006) (2006)
Gaussier, E., Goutte, C.: Relation between plsa and nmf and implications. In: SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 601–602. ACM Press, New York, NY, USA (2005). http://doi.acm.org/10.1145/1076034.1076148
Girolami, M., Kaban, A.: On an equivalence between plsi and lda. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 433–434. ACM Press, New York (2003). http://doi.acm.org/10.1145/860435.860537
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM Press, New York (1999). http://doi.acm.org/10.1145/312624.312649
Hofmann T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2): 177–196
Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments, part 2. Inf. Processing and Management 36(6), 809–840 (2000). http://dx.doi.org/10.1016/S0306-4573(00)00015-7
Keller, M., Bengio, S.: A neural network for text representation. In: Duch, W., Kacprzyk, J., Oja, E., Zadrozny S. (eds) ICANN (2), Lecture Notes in Computer Science, vol. 3697, pp. 667–672. Springer, Heidelberg (2005)
Kim, Y., Chang, J.H., Zhang, B.T.: An empirical study on dimensionality optimization in text mining for linguistic knowledge acquisition. In: Whang, K.Y., Jeon, J., Shim, K., Srivastava J. (eds) PAKDD, Lecture Notes in Computer Science, vol. 2637, pp. 111–116. Springer, Heidelberg (2003)
Lloyd, R., Shakiban, C.: Improvements in latent semantic analysis. Am. J. Undergraduate Res. 3(2): (2004)
Matveeva, I., Levow, G.A., Farahat, A., Royer, C.: Term representation with generalized latent semantic analysis. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-05). Borovets, Bulgaria (2005)
Nakov, P., Popova, A., Mateev, P.: Weight functions impact on lsa performance. In: Proceedings of the EuroConference Recent Advances in Natural Language Processing, pp. 187–193. Tzigov Chark, Bulgaria (2001)
Park, L.A.F., Ramamohanarao, K.: Hybrid pre-query term expansion using latent semantic analysis. In: Rastogi, R., Morik, K., Bramer, M., Wu, X. (eds) The Fourth IEEE International Conference on Data Mining, pp. 178–185. IEEE Computer Society, Los Alamitos, CA (2004). doi: 10.1109/ICDM.2004.10085. http://www.cs.mu.oz.au/~lapark/park_hybridlsa2004.pdf
Park, L.A.F., Ramamohanarao, K.: Query expansion using a collection dependent probabilistic latent semantic thesaurus. In: Zhou, Z.H., Li, H., Yang, Q. (eds) PAKDD, Lecture Notes in Computer Science, vol. 4426, pp. 224–235. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-71701-0_24
Robertson, S.E., Walker, S., Jones, S., Hancock-BeauLieu, M.M., Gatford, M.: Okapi at TREC-3. In: Harman, D. (ed.) The Third Text REtrieval Conference (TREC-3), pp. 109–126. National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Md. 20899 (1994)
Song, F., Croft, W.B.: A general language model for information retrieval. In: CIKM 1999: Proceedings of the Eighth International Conference on Information and Knowledge Management, pp. 316–321. ACM Press, New York (1999). http://doi.acm.org/10.1145/319950.320022
Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic author-topic models for information discovery. In: KDD 2004: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 306–315. ACM Press, New York, NY, USA (2004). http://doi.acm.org/10.1145/1014052.1014087
Voorhees, E.M., Buckland, L.P.: Common evaluation measures. In: The Fifteenth Text REtrieval Conference Proceedings (TREC 2006) (2006)
Witten I.H., Moffat A. and Bell T.C. (1999). Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, San Francisco
Xu, B., Lu, J., Huang, G.: A constrained non-negative matrix factorization in information retrieval. In: IEEE International Conference on Information Reuse and Integration, pp. 273–277 (2003)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Park, L.A.F., Ramamohanarao, K. Efficient storage and retrieval of probabilistic latent semantic information for information retrieval. The VLDB Journal 18, 141–155 (2009). https://doi.org/10.1007/s00778-008-0093-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-008-0093-2