ABSTRACT
Document representation and indexing is a key problem for document analysis and processing, such as clustering, classification and retrieval. Conventionally, Latent Semantic Indexing (LSI) is considered effective in deriving such an indexing. LSI essentially detects the most representative features for document representation rather than the most discriminative features. Therefore, LSI might not be optimal in discriminating documents with different semantics. In this paper, a novel algorithm called Locality Preserving Indexing (LPI) is proposed for document indexing. Each document is represented by a vector with low dimensionality. In contrast to LSI which discovers the global structure of the document space, LPI discovers the local structure and obtains a compact document representation subspace that best detects the essential semantic structure. We compare the proposed LPI approach with LSI on two standard databases. Experimental results show that LPI provides better representation in the sense of semantic structure.
- R. K. Ando, "Latent Semantic Space: Iterative Scaling improves precision of inter-document similarity measurement", in Proc. of the 23rd International ACM SIGIR, Athens, Greece, 2000. Google ScholarDigital Library
- R. K. Ando, and L. Lee, "Iterative Residual Rescaling: An Analysis and Generalization of LSI", in Proc. of the 24th International ACM SIGIR, New Orleans, LA, 2001. Google ScholarDigital Library
- B. T. Bartell, G. W. Cottrell, and R. K. Belew, "Latent Semantic Indexing is an Optimal Special Case of Multidimensional Scaling", in Proc. of 15th International ACM SIGIR, Copenhagen, Denmark, 1992. Google ScholarDigital Library
- M. Belkin and P. Niyogi, "Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering", Advances in Neural Information Processing Systems 14, Vancouver, Canada, 2001.Google Scholar
- E. Bingham and H. Mannila, "Random Projection in dimensionality reduction: applications to image and text data", Proc. Of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 245--250, 2001. Google ScholarDigital Library
- Fan R. K. Chung, Spectral Graph Theory, Regional Conferences Series in Mathematics, number 92, 1997.Google Scholar
- S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. harshman, "Indexing by Latent Semantic Analysis", Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
- L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer-Verlag New York, Inc., 1996.Google Scholar
- C. H. Ding, "A similarity-based probability model for Latent Semantic Indexing", in Proc. of the 22nd International ACM SIGIR, 1999. Google ScholarDigital Library
- Richard O. Duda, Peter E. Hart and David G. Stork, Pattern Classification (2nd Edition), Wiley-Interscience, 2000. Google ScholarDigital Library
- S. T. Dumais and J. Nielsen, "Automating the assignment of submitted manuscripts to reviewers", in Proc. of the 15th ACM SIGIR, Copenhagen, Denmark, 1992. Google ScholarDigital Library
- P. W. Foltz and S. T. Dumais, "Personalized information delivery: An analysis of information filtering methods", Communications of the ACM, 35(12):51--60, 1992. Google ScholarDigital Library
- Xiaofei He and Partha Niyogi, "Locality Preserving Projections", in Advances in Neural Information Processing Systems 16, Vancouver, Canada, 2003.Google Scholar
- T. Hofmann, "Probabilistic Latent Semantic Indexing", in Proc. of the 22nd International ACM SIGIR, Berkeley, California, 1999. Google ScholarDigital Library
- C. L. Isbell and P. Viola, "Restructuring Sparse High Dimensional Data for Effective Retrieval", Advances in Neural Information Systems, 1999. Google ScholarDigital Library
- T. G. Kolda and D. P. O'Leary, "A Semidiscrete matrix decomposition for latent semantic indexing in information retrieval", ACM Transactions on Information Systems, 16(4):322--346, 1998. Google ScholarDigital Library
- K. Lang, "Learning to filter netnews", Proc. Of the 12th Int. Conf. on Machine Learning, 1995.Google Scholar
- C.H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, "Latent semantic indexing: a probabilistic analysis," in Proc. 17th ACM Symp. Principles of Database Systems, Seattle, 1998. Google ScholarDigital Library
- S. T. Roweis, L. K. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding", Science, vol 290, 22 December 2000.Google Scholar
- G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
- J. B. Tenenbaum, Vin De Silva, and J. C. Langford, "A Global Geometric Framework for Nonlinear Dimensionality Reduction", Science, Vol 290, 22 December 2000.Google Scholar
- W. Xu, X. Liu, and Y. Gong, "Document Clustering Based on Non-Negative Matrix Factorization", in Proc. of the 26th International ACM SIGIR, Toronto, Canada, 2003. Google ScholarDigital Library
Index Terms
- Locality preserving indexing for document representation
Recommendations
Orthogonal locality preserving indexing
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalWe consider the problem of document indexing and representation. Recently, Locality Preserving Indexing (LPI) was proposed for learning a compact document subspace. Different from Latent Semantic Indexing which is optimal in the sense of global ...
Regularized locality preserving indexing via spectral regression
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementWe consider the problem of document indexing and representation. Recently, Locality Preserving Indexing (LPI) was proposed for learning a compact document subspace. Different from Latent Semantic Indexing (LSI) which is optimal in the sense of global ...
Document Clustering Using Locality Preserving Indexing
We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the ...
Comments