ABSTRACT
Topic modeling can boost the performance of information retrieval, but its real-world application is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input vocabulary. We introduce Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization. It is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary. RLSI formalizes topic modeling as a problem of minimizing a quadratic loss function regularized by l₂ and/or l₁ norm. This formulation allows the learning process to be decomposed into multiple sub-optimization problems which can be optimized in parallel, for example via MapReduce. We particularly propose adopting l₂ norm on topics and l₁ norm on document representations, to create a model with compact and readable topics and useful for retrieval. Relevance ranking experiments on three TREC datasets show that RLSI performs better than LSI, PLSI, and LDA, and the improvements are sometimes statistically significant. Experiments on a web dataset, containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance on a larger corpus and vocabulary than in previous studies.
- L. AlSumait, D. Barbara, and C. Domeniconi. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In ICDM, 2008. Google ScholarDigital Library
- A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed estimation of topic models for document analysis. Statistical Methodology, 2011.Google ScholarCross Ref
- D. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
- A. Buluc and J. R. Gilbert. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP, pages 503--510, 2008. Google ScholarDigital Library
- C. J. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In NIPS 19, 2007.Google Scholar
- R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient parallel processing of massive data sets. VLDB Endow., 1:1265--1276, 2008. Google ScholarDigital Library
- S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SISC, 20:33--61, 1998. Google ScholarDigital Library
- X. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell. Sparse latent semantic analysis. In NIPS Workshop, 2010.Google Scholar
- J. Dean, S. Ghemawat, and G. Inc. Mapreduce: simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J AM SOC INFORM SCI, 41:391--407, 1990.Google ScholarCross Ref
- C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing semantic indexing. COMPUT STAT DATA AN, 52:3913--3927, 2008. Google ScholarDigital Library
- B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. ANN STAT, 32:407--499, 2004.Google ScholarCross Ref
- J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. ANN APPL STAT, 1:302--332, 2007.Google ScholarCross Ref
- W. J. Fu. Penalized regressions: The bridge versus the lasso. J COMPUT GRAPH STAT, 7:397--416, 1998.Google Scholar
- M. D. Hoffman, D. M. Blei, and F. Bach. Online learning for latent dirichlet allocation. In NIPS, 2010.Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999. Google ScholarDigital Library
- D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization. Nature, 401:391--407, 1999.Google ScholarCross Ref
- D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS 13, pages 556--562. 2001.Google Scholar
- H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In NIPS, pages 801--808. 2007.Google ScholarDigital Library
- C. Liu, H. chih Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In WWW, pages 681--690, 2010. Google ScholarDigital Library
- Z. Liu, Y. Zhang, and E. Y. Chang. PldaGoogle Scholar
- : Parallel latent dirichlet allocation with data placement and pipeline processing. In TIST, 2010.Google Scholar
- J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In NIPS 21, pages 1033--1040. 2009.Google Scholar
- D. M. Mimno and McCallum. Organizing the oca: Learning faceted subjects from a library of digital books. In JCDL, pages 376--385, 2007. Google ScholarDigital Library
- D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In NIPS, 2008.Google Scholar
- B. A. Olshausen and D. J. Fieldt. Sparse coding with an overcomplete basis set: a strategy employed by v1. VISION RES, 37:3311--3325, 1997.Google ScholarCross Ref
- M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in least squares problems. IMA J NUMER ANAL, 2000.Google ScholarCross Ref
- S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In TREC'3, 1994.Google Scholar
- R. Rubinstein, M. Zibulevsky, and M. Elad. Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE T SIGNAL PROCES, pages 1553--1564, 2008. Google ScholarDigital Library
- G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18:613--620, 1975. Google ScholarDigital Library
- A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In ECMLPKDD, pages 358--373, 2008. Google ScholarDigital Library
- A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proc. VLDB Endow., 3:703--710, 2010. Google ScholarDigital Library
- R. Thakur and R. Rabenseifner. Optimization of collective communication operations in mpich. INT J HIGH PERFORM C, 19:49--66, 2005.Google ScholarDigital Library
- C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierachical dirichlet process. In NIPS, 2009.Google Scholar
- Y. Wang, H. Bai, M. Stanton, W. yen Chen, and E. Y. Chang. Plda: Parallel latent dirichlet allocation for large-scale applications. In AAIM, pages 301--314, 2009. Google ScholarDigital Library
- X. Wei and B. W. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006. Google ScholarDigital Library
- F. Yan, N. Xu, and Y. A. Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In NIPS, pages 2134--2142, 2009.Google Scholar
Index Terms
- Regularized latent semantic indexing
Recommendations
Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling
Topic modeling provides a powerful way to analyze the content of a collection of documents. It has become a popular tool in many research areas, such as text mining, information retrieval, natural language processing, and other related fields. In real-...
Fast and Modular Regularized Topic Modelling
FRUCT'21: Proceedings of the 21st Conference of Open Innovations Association FRUCTTopic modelling is an area of text mining that has been actively developed in the last 15 years. A probabilistic topic model extracts a set of hidden topics from a collection of text documents. It defines each topic by a probability distribution over ...
Latent Community Topic Analysis: Integration of Community Discovery with Topic Modeling
This article studies the problem of latent community topic analysis in text-associated graphs. With the development of social media, a lot of user-generated content is available with user networks. Along with rich information in networks, user graphs ...
Comments