ABSTRACT
We consider the problem of document indexing and representation. Recently, Locality Preserving Indexing (LPI) was proposed for learning a compact document subspace. Different from Latent Semantic Indexing which is optimal in the sense of global Euclidean structure, LPI is optimal in the sense of local manifold structure. However, LPI is extremely sensitive to the number of dimensions. This makes it difficult to estimate the intrinsic dimensionality, while inaccurately estimated dimensionality would drastically degrade its performance. One reason leading to this problem is that LPI is non-orthogonal. Non-orthogonality distorts the metric structure of the document space. In this paper, we propose a new algorithm called Orthogonal LPI. Orthogonal LPI iteratively computes the mutually orthogonal basis functions which respect the local geometrical structure. Moreover, our empirical study shows that OLPI can have more locality preserving power than LPI. We compare the new algorithm to LSI and LPI. Extensive experimental results show that Orthogonal LPI obtains better performance than both LSI and LPI. More crucially, it is insensitive to the number of dimensions, which makes it an efficient data preprocessing method for text clustering, classification, retrieval, etc.
- R. Ando. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. In Proceedings of ACM SIGIR, 2000. Google ScholarDigital Library
- R. Ando and L. Lee. Iterative residual rescaling: An analysis and generalization. In Proceedings of ACM SIGIR, 2001. Google ScholarDigital Library
- B. T. Bartell, G. W. Cottrell, and R. K. Belew. Latent semantic indexing is an optimal special case of multidimensional scaling. In Proceedings of ACM SIGIR, 1992. Google ScholarDigital Library
- M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 14, 2001.Google Scholar
- M. Belkin, P. Niyogi, and V. Sindhwani. On maniold regularization. Technical report tr-2004-05, Computer Science Department, The University of Chicago, 2004.Google Scholar
- F. R. K. Chung. Spectral Graph Theory, volume 92 of Regional Conference Series in Mathematics. 1997.Google Scholar
- S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
- C. H. Ding. A similarity-based probability model for latent semantic indexing. In Proceedings of ACM SIGIR, 1999. Google ScholarDigital Library
- R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, Hoboken, NJ, 2nd edition, 2000. Google ScholarDigital Library
- G. H. Golub and C. F. V. Loan. Matrix computations. Johns Hopkins University Press, 3rd edition, 1996.Google Scholar
- X. He, D. Cai, H. Liu, and W.-Y. Ma. Locality preserving indexing for document representation. In Proceedings of ACM SIGIR, 2004. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of ACM SIGIR, 1999. Google ScholarDigital Library
- B. Kegl. Intrinsic dimension estimation using packing numbers. In Advances in Neural Information Processing Systems 15, 2002.Google Scholar
- E. Kokiopoulou and Y. Saad. Polynomial filtering in latent semantic indexing for information retrieval. In Proceedings of ACM SIGIR, 2004. Google ScholarDigital Library
- L. Lovasz and M. Plummer. Matching Theory. Akadémiai Kiadó, North Holland, Budapest, 1986.Google Scholar
- P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high confidence from random samples. Technical report tr-2004-08, Department of Computer Science, University of Chicago, 2004.Google Scholar
- C. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: a probabilistic analysis. In Proc. 17th ACM Symp. Principles of Database Systems, 1998. Google ScholarDigital Library
- S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323--2326, 2000.Google ScholarCross Ref
- C. Tang, S. Dwarkadas, and Z. Xu. On scaling latent semantic indexing for large peer-to-peer systems. In Proceedings of ACM SIGIR, 2004. Google ScholarDigital Library
- J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319--2323, 2000.Google ScholarCross Ref
- U. von Luxburg, O. Bousquet, and M. Belkin. Limits of spectral clustering. In Advances in Neural Information Processing Systems 17, 2004.Google Scholar
- W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings of ACM SIGIR, 2003. Google ScholarDigital Library
Index Terms
- Orthogonal locality preserving indexing
Recommendations
Locality preserving indexing for document representation
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalDocument representation and indexing is a key problem for document analysis and processing, such as clustering, classification and retrieval. Conventionally, Latent Semantic Indexing (LSI) is considered effective in deriving such an indexing. LSI ...
Document Clustering Using Locality Preserving Indexing
We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the ...
Regularized locality preserving indexing via spectral regression
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementWe consider the problem of document indexing and representation. Recently, Locality Preserving Indexing (LPI) was proposed for learning a compact document subspace. Different from Latent Semantic Indexing (LSI) which is optimal in the sense of global ...
Comments