Abstract
Text collections represented in LSI model are hard to search efficiently (i.e. quickly), since there exists no indexing method for the LSI matrices. The inverted file, often used in both boolean and classic vector model, cannot be effectively utilized, because query vectors in LSI model are dense. A possible way for efficient search in LSI matrices could be the usage of metric access methods (MAMs). Instead of cosine measure, the MAMs can utilize the deviation metric for query processing as an equivalent dissimilarity measure. However, the intrinsic dimensionality of collections represented by LSI matrices is often large, which decreases MAMs’ performance in searching. In this paper we introduce σ-LSI, a modification of LSI in which we artificially decrease the intrinsic dimensionality of LSI matrices. This is achieved by an adjustment of singular values produced by SVD. We show that suitable adjustments could dramatically improve the efficiency when searching by MAMs, while the precision/recall values remain preserved or get only slightly worse.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 35–42. ACM Press, New York (2001)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, New York (1999)
Berry, M., Browne, M.: Understanding Search Engines, Mathematical Modeling and Text Retrieval. SIAM, Philadelphia (1999)
Berry, M., Dumais, S., Letsche, T.: Computation Methods for Intelligent Information Access. In: Proceedings of the 1995 ACM/IEEE Supercomputing Conference (1995)
Berry, M.W., Fierro, R.D.: Low-Rank Orthogonal Decomposition for Information Retrieval Applications. Numerical Algebra with Applications 1(1), 1–27 (1996)
Böhm, C., Berchtold, S., Keim, D.: Searching in High-Dimensional Spaces – Index Structures for Improving the Performance of Multimedia Databases. ACM Computing Surveys 33(3), 322–373 (2001)
Chávez, E., Navarro, G.: A probabilistic spell for the curse of dimensionality. In: Buchsbaum, A.L., Snoeyink, J. (eds.) ALENEX 2001. LNCS, vol. 2153, p. 147. Springer, Heidelberg (2001)
Chávez, E., Navarro, G., Baeza-Yates, R., Marroquín, J.L.: Searching in metric spaces. ACM Compututing Surveys 33(3), 273–321 (2001)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: Proceedings of the 23rd Athens Intern. Conf. on VLDB, pp. 426–435. Morgan Kaufmann, San Francisco (1997)
Dohnal, V., Gennaro, C., Savino, P., Zezula, P.: D-index: Distance searching index for metric data sets. Multimedia Tools Applications 21(1), 9–33 (2003)
Frieze, A., Kannan, R., Vempala, S.: Fast Monte-Carlo Algorithms for Finding Low Rank Approximations. In: Proceedings of 1998 FOCS, pp. 370–378 (1998)
Golub, G.H., Loan, C.F.V.: Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)
Larsen, R.M.: Lanczos bidiagonalization with partial reorthogonalization. Technical report, University of Aarhus (1998)
Micó, M.L., Oncina, J., Vidal, E.: An algorithm for finding nearest neighbour in constant average time with a linear space complexity. In: International Conference on Pattern Recognition, pp. 557–560 (1992)
Moffat, A., Zobel, J.: Fast ranking in limited space. In: Proceedings of the Tenth International Conference on Data Engineering, pp. 428–437. IEEE Computer Society Press, Los Alamitos (1994)
Kanerva, J.K.P., Holst, A.: Random Indexing of Text Samples for Latent Semantic Analysis. In: Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036 (2000)
Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: A probabilistic analysis. In: Proocedings of the ACM Conference on Principles of Database Systems (PODS), pp. 159–168 (1998)
Persin, M.: Document filtering for fast ranking. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 339–348. Springer, New York (1994)
Ponte, J., Croft, W.: A language modelling approach to IR. In: Proceedings of the 21 st ACM SIGIR Conference, pp. 275–281 (1998)
Skopal, T., Moravec, P., Pokorný, J., Snášel, V.: Metric Indexing for the Vector Model in Text Retrieval. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 183–195. Springer, Heidelberg (2004)
Voorhees, E.M., Harman, D.: Overview of the sixth text REtrieval conference (TREC-6). Information Processing and Management 36(1), 3–35 (2000)
Yanilos, P.N.: Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. In: Proceedings of Fourth Annual ACM/SIGACT-SIAM Symposium on Discrete Algorithms - SODA, pp. 311–321 (1993)
Zezula, P., Savino, P., Amato, G., Rabitti, F.: Approximate Similarity Retrieval with M-Trees. VLDB Journal 7(4), 275–293 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Skopal, T., Moravec, P. (2005). Modified LSI Model for Efficient Search by Metric Access Methods. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-31865-1_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25295-5
Online ISBN: 978-3-540-31865-1
eBook Packages: Computer ScienceComputer Science (R0)