skip to main content
10.1145/2009916.2010008acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Regularized latent semantic indexing

Published:24 July 2011Publication History

ABSTRACT

Topic modeling can boost the performance of information retrieval, but its real-world application is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input vocabulary. We introduce Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization. It is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary. RLSI formalizes topic modeling as a problem of minimizing a quadratic loss function regularized by l₂ and/or l₁ norm. This formulation allows the learning process to be decomposed into multiple sub-optimization problems which can be optimized in parallel, for example via MapReduce. We particularly propose adopting l₂ norm on topics and l₁ norm on document representations, to create a model with compact and readable topics and useful for retrieval. Relevance ranking experiments on three TREC datasets show that RLSI performs better than LSI, PLSI, and LDA, and the improvements are sometimes statistically significant. Experiments on a web dataset, containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance on a larger corpus and vocabulary than in previous studies.

References

  1. L. AlSumait, D. Barbara, and C. Domeniconi. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In ICDM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed estimation of topic models for document analysis. Statistical Methodology, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  3. D. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Buluc and J. R. Gilbert. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP, pages 503--510, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. J. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In NIPS 19, 2007.Google ScholarGoogle Scholar
  6. R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient parallel processing of massive data sets. VLDB Endow., 1:1265--1276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SISC, 20:33--61, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. X. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell. Sparse latent semantic analysis. In NIPS Workshop, 2010.Google ScholarGoogle Scholar
  9. J. Dean, S. Ghemawat, and G. Inc. Mapreduce: simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J AM SOC INFORM SCI, 41:391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  11. C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing semantic indexing. COMPUT STAT DATA AN, 52:3913--3927, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. ANN STAT, 32:407--499, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. ANN APPL STAT, 1:302--332, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  14. W. J. Fu. Penalized regressions: The bridge versus the lasso. J COMPUT GRAPH STAT, 7:397--416, 1998.Google ScholarGoogle Scholar
  15. M. D. Hoffman, D. M. Blei, and F. Bach. Online learning for latent dirichlet allocation. In NIPS, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization. Nature, 401:391--407, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  18. D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS 13, pages 556--562. 2001.Google ScholarGoogle Scholar
  19. H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In NIPS, pages 801--808. 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Liu, H. chih Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In WWW, pages 681--690, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Z. Liu, Y. Zhang, and E. Y. Chang. PldaGoogle ScholarGoogle Scholar
  22. : Parallel latent dirichlet allocation with data placement and pipeline processing. In TIST, 2010.Google ScholarGoogle Scholar
  23. J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In NIPS 21, pages 1033--1040. 2009.Google ScholarGoogle Scholar
  24. D. M. Mimno and McCallum. Organizing the oca: Learning faceted subjects from a library of digital books. In JCDL, pages 376--385, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In NIPS, 2008.Google ScholarGoogle Scholar
  26. B. A. Olshausen and D. J. Fieldt. Sparse coding with an overcomplete basis set: a strategy employed by v1. VISION RES, 37:3311--3325, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  27. M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in least squares problems. IMA J NUMER ANAL, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  28. S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In TREC'3, 1994.Google ScholarGoogle Scholar
  29. R. Rubinstein, M. Zibulevsky, and M. Elad. Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE T SIGNAL PROCES, pages 1553--1564, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18:613--620, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In ECMLPKDD, pages 358--373, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proc. VLDB Endow., 3:703--710, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Thakur and R. Rabenseifner. Optimization of collective communication operations in mpich. INT J HIGH PERFORM C, 19:49--66, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierachical dirichlet process. In NIPS, 2009.Google ScholarGoogle Scholar
  35. Y. Wang, H. Bai, M. Stanton, W. yen Chen, and E. Y. Chang. Plda: Parallel latent dirichlet allocation for large-scale applications. In AAIM, pages 301--314, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. X. Wei and B. W. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. F. Yan, N. Xu, and Y. A. Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In NIPS, pages 2134--2142, 2009.Google ScholarGoogle Scholar

Index Terms

  1. Regularized latent semantic indexing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
      July 2011
      1374 pages
      ISBN:9781450307574
      DOI:10.1145/2009916

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 July 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader