Skip to main content

Large-scale SVD and subspace-based methods for information retrieval

  • Regular Talks
  • Conference paper
  • First Online:
Solving Irregularly Structured Problems in Parallel (IRREGULAR 1998)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1457))

Abstract

A theoretical foundation for latent semantic indexing (LSI) is proposed by adapting a model first used in array signal processing to the context of information retrieval using the concept of subspaces. It is shown that this subspace-based model coupled with minimal description length (MDL) principle leads to a statistical test to determine the dimensions of the latent-concept subspaces in LSI. The effect of weighting on the choice of the optimal dimensions of latent-concept subspaces is illustrated. It is also shown that the model imposes a so-called low-rank-plus-shift structure that is approximately satisfied by the cross-product of the term-document matrices. This structure can be exploited to give a more accurate updating scheme for LSI and to correct some of the misconception about the achievable retrieval accuracy in LSI updating. Variants of Lanczos algorithms are illustrated with numerical test results on Cray T3E using document collections generated from World Wide Web.

This work was supported by the Director, Office of Energy Research, Office of Laboratory Policy and Infrastructure Management, of the U.S. Department of Energy under Contract No. DE-AC03-76SF00098 and NSF grant CCR-9619452.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. M.W. Berry, S.T. Dumais and G.W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573–595, 1995.

    Google Scholar 

  2. L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. W. Demmel, I. Dhillon, J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK User's Guide. SIAM, Philadelphia, USA, 1997.

    Google Scholar 

  3. Cornell SMART System, ftp://ftp.cs.cornell.edu/pub/smart.

    Google Scholar 

  4. S. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas and R.A. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41:391–407, 1990.

    Google Scholar 

  5. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, USA, third edition, 1996.

    Google Scholar 

  6. R. G. Grimes, J. G. Lewis, and H. D. Simon. A Shifted Block Lanczos Algorithm for Solving Sparse Symmetric Eigenvalue Problems. SIAM J. Matrix Anal. Appl., 15:228–272, 1994.

    Google Scholar 

  7. D. Harman. TREC-3 conference report. NIST Special Publication 500-225, 1995.

    Google Scholar 

  8. G. Kowalski. Information Retrieval System: Theory and Implementation. Kluwer Academic Publishers, Boston, 1997.

    Google Scholar 

  9. R. Krovetz and W.B. Croft. Lexical ambiguity and information retrieval. ACM Transactions on Information Systems, 10:115–141, 1992.

    Google Scholar 

  10. B. Nour-Omid, B. N. Parlett, T. Ericsson, and P. S. Jensen. How to Implement the Spectral Transformation. Mathematics of Computation, 48:663–673, 1987.

    Google Scholar 

  11. G.W. O'Brien. Information Management Tools for Updating an SVD-Encoded Indexing Scheme. M.S. Thesis, Department of Computer Science, Univ. of Tennessee, 1994.

    Google Scholar 

  12. O.A. Marques.BLZPACK: Description and User's Guide. CERFACS, TR/PA/95/30, 1995.

    Google Scholar 

  13. B. N. Parlett. The Symmetric Eigenvalue Problem. Prentice Hall, Englewood Cliffs, USA, 1980.

    Google Scholar 

  14. B. N. Parlett and D. S. Scott. The Lanczos Algorithm with Selective Orthogonalization. Mathematics of Computation, 33:217–238, 1979.

    Google Scholar 

  15. G. Salton. Automatic Text Processing. Addison-Wesley, New York, 1989.

    Google Scholar 

  16. H. D. Simon. The Lanczos Algorithm with Partial Reorthogonalization. Mathematics of Computation, 42:115–142, 1984.

    Google Scholar 

  17. H.D. Simon and H. Zha. Low rank matrix approximation using the Lanczos bidiagonalization process with applications. Technical Report CSE-97-008, Department of Computer Science and Engineering, The Pennsylvania State University, 1997.

    Google Scholar 

  18. G. Xu and T. Kailath. Fast subspace decomposotion. IEEE Transactions on Signal Processing, 42:539–551, 1994.

    Google Scholar 

  19. G. Xu, H. Zha, G. Golub, and T. Kailath. Fast algorithms for updating signal subspaces. IEEE Transactions on Circuits and Systems, 41:537–549, 1994.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongyuan Zha .

Editor information

Alfonso Ferreira José Rolim Horst Simon Shang-Hua Teng

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zha, H., Marques, O., Simon, H.D. (1998). Large-scale SVD and subspace-based methods for information retrieval. In: Ferreira, A., Rolim, J., Simon, H., Teng, SH. (eds) Solving Irregularly Structured Problems in Parallel. IRREGULAR 1998. Lecture Notes in Computer Science, vol 1457. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0018525

Download citation

  • DOI: https://doi.org/10.1007/BFb0018525

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-64809-3

  • Online ISBN: 978-3-540-68533-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics