Skip to main content

Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond

  • Chapter
  • First Online:
Mining Text Data

Abstract

The bag-of-words representation commonly used in text analysis can be analyzed very efficiently and retains a great deal of useful information, but it is also troublesome because the same thought can be expressed using many different terms or one term can have very different meanings. Dimension reduction can collapse together terms that have the same semantics, to identify and disambiguate terms with multiple meanings and to provide a lower-dimensional representation of documents that reflects concepts instead of raw terms. In this chapter, we survey two influential forms of dimension reduction. Latent semantic indexing uses spectral decomposition to identify a lower-dimensional representation that maintains semantic properties of the documents. Topic modeling, including probabilistic latent semantic indexing and latent Dirichlet allocation, is a form of dimension reduction that uses a probabilistic model to find the co-occurrence patterns of terms that correspond to semantic topics in a collection of documents. We describe the basic technologies in detail and expose the underlying mechanism. We also discuss recent advances that have made it possible to apply these techniques to very large and evolving text collections and to incorporate network structure or other contextual information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels. J. Mach. Learn. Res., 9:1981–2014, June 2008.

    MATH  Google Scholar 

  2. D. Andrzejewski, X. Zhu, M. Craven, and B. Recht. A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. In IJCAI, 2011.

    Google Scholar 

  3. A. Asuncion, M. Welling, P. Smyth, and Y. Teh. On smoothing and inference for topic models. In UAI, pages 27–34, 2009.

    Google Scholar 

  4. L. Bahl, J. Baker, E. Jelinek, and R. Mercer. Perplexity—a measure of the difficulty of speech recognition tasks. In Program, 94th Meeting of the Acoustical Society of America, volume 62, page S63, 1977.

    Google Scholar 

  5. H. Bast and D. Majumdar. Why spectral retrieval works. In SIGIR, page 11, 2005.

    Google Scholar 

  6. J.-P. Benzecri. L’Analyse des Donnees. Volume II. 1973.

    Google Scholar 

  7. M. Berry. Large-scale sparse singular value computations. The International Journal Of Supercomputer Applications, 6(1):13–49, 1992.

    Google Scholar 

  8. M. Berry, S. Dumais, and G. O’Brien. Using linear algebra for intelligent information retrieval. SIAM review, 37(4):573–595, 1995.

    Article  MathSciNet  MATH  Google Scholar 

  9. D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In NIPS, 2003.

    Google Scholar 

  10. D. Blei and J. Lafferty. Dynamic topic models. In ICML, pages 113–120, 2006.

    Google Scholar 

  11. D. Blei and J. Lafferty. A correlated topic model of science. AAS, 1(1):17–35, 2007.

    MathSciNet  MATH  Google Scholar 

  12. D. Blei and J. McAuliffe. Supervised topic models. In NIPS, 2007.

    Google Scholar 

  13. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, 2003.

    MATH  Google Scholar 

  14. J. Boyd-Graber and D. Blei. Multilingual topic models for unaligned text. In UAI, pages 75–82, 2009.

    Google Scholar 

  15. J. Boyd-Graber and D. Blei. Syntactic topic models. In NIPS, pages 185–192. 2009.

    Google Scholar 

  16. W. Buntine and A. Jakulin. Discrete component analysis. In Craig Saunders, Marko Grobelnik, Steve Gunn, and John Shawe-Taylor, editors, Subspace, Latent Structure and Feature Selection, volume 3940 of Lecture Notes in Computer Science, pages 1–33. Springer Berlin / Heidelberg, 2006.

    Google Scholar 

  17. J. Chang and D. Blei. Relational topic models for document networks. In AIStats, 2009.

    Google Scholar 

  18. J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, pages 288–296. 2009.

    Google Scholar 

  19. K. Church and W. Gale. Poisson mixtures. Natural Language Engineering, 1:163–190, 1995.

    Google Scholar 

  20. D. Cohn. The missing link-a probabilistic model of document content and hypertext connectivity. In NIPS, 2001.

    Google Scholar 

  21. D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In ICML, pages 167–174, 2001.

    Google Scholar 

  22. S. Crain, S.-H. Yang, Y. Jiao, and H. Zha. Dialect topic modeling for improved consumer medical search. In AMIA Annual Symposium, 2010.

    Google Scholar 

  23. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, September 1990.

    Article  Google Scholar 

  24. A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39(1):1–38, 1977.

    MathSciNet  MATH  Google Scholar 

  25. H. Deng, J. Han, B. Zhao, Y. Yu, and C. Lin. Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks. In KDD, pages 1271—-1279, San Diego, 2011. ACM.

    Google Scholar 

  26. C. Ding. A similarity-based probability model for latent semantic indexing. In SIGIR, pages 58–65, 1999.

    Google Scholar 

  27. G. Doyle and C. Elkan. Accounting for burstiness in topic models. In ICML, 2009.

    Google Scholar 

  28. S. Dumais and J. Nielsen. Automating the assignment of submitted manuscripts to reviewers. In SIGIR, pages 233–244, 1992.

    Google Scholar 

  29. G. Dupret. Latent concepts and the number orthogonal factors in latent semantic analysis. SIGIR, pages 221–226, 2003.

    Google Scholar 

  30. G. Golub and C. Van Loan. Matrix computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996.

    Google Scholar 

  31. T. Griffiths and M. Steyvers. Latent Semantic Analysis: A Road to Meaning, chapter Probabilistic topic models. 2006.

    Google Scholar 

  32. T. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences of the United States of America, volume 101, pages 5228–5235, 2004.

    Google Scholar 

  33. T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Integrating topics and syntax. In NIPS, pages 537–544, 2005.

    Google Scholar 

  34. Z, Guo, S. Zhu, Y. Chi, Z. Zhang, and Y. Gong. A latent topic model for linked documents. In SIGIR, page 720, 2009.

    Google Scholar 

  35. M. Hoffman, D. Blei, and F. Bach. Online learning for latent Dirichlet allocation. In NIPS, pages 856–864, 2010.

    Google Scholar 

  36. T. Hofmann. Probabilistic latent semantic analysis. In UAI, page 21, 1999.

    Google Scholar 

  37. T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50–57, 1999.

    Google Scholar 

  38. R. Kubota Ando and L. Lee. Iterative residual rescaling: An analysis and generalization of LSI. In SIGIR, pages 154–162, 2001.

    Google Scholar 

  39. S. Kullback and R. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, March 1951.

    Article  MathSciNet  MATH  Google Scholar 

  40. T. Landauer. On the computational basis of learning and cognition: Arguments from LSA. Psychology of learning and motivation, (1):1– 63, 2002.

    MathSciNet  Google Scholar 

  41. W. Li, D. Blei, and A. McCallum. Nonparametric Bayes Pachinko allocation. In UAI, 2007.

    Google Scholar 

  42. G. Lisowsky and L. Rost. Konkordanz zum hebr¨aischen Alten Testament: nach dem von Paul Kahle in der Biblia Hebraica edidit Rudolf Kittel besorgten Masoretischen Text. Deutsche Bibelgesellschaft, 1958.

    Google Scholar 

  43. Z. Liu, Y. Zhang, E.Y. Chang, and M. Sun. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol., 2:26:1–26:18, May 2011.

    Google Scholar 

  44. C. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

    Google Scholar 

  45. A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In Proceedings of the 19th international joint conference on Artificial intelligence, pages 786–791, 2005.

    Google Scholar 

  46. Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. In WWW, page 101, 2008.

    Google Scholar 

  47. Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In KDD, pages 490–499, 2007.

    Google Scholar 

  48. D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In UAI, 2008.

    Google Scholar 

  49. D. Mimno, H.Wallach, J. Naradowsky, D. Smith, and A. McCallum. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 880–889, 2009.

    Google Scholar 

  50. L. Molgaard, J. Larsen, and D. Lyngby. Temporal analysis of text data using latent variable models. 2009 IEEE International Workshop on Machine Learning for Signal Processing, 2009.

    Google Scholar 

  51. A. Ng, A. Zheng, and M. Jordan. Link analysis, eigenvectors and stability. In International Joint Conference on Artificial Intelligence, volume 17, pages 903–910, 2001.

    Google Scholar 

  52. G. O’Brien. Information management tools for updating an SVDencoded indexing scheme. Master’s thesis, The University of Knoxville, Tennessee, (October), 1994.

    Google Scholar 

  53. C. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: A probabilistic analysis. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 159–168, 1998.

    Google Scholar 

  54. J. Reisinger, A. Waters, B. Silverthorn, and R. Mooney. Spherical topic models. In ICML, pages 903–910, 2010.

    Google Scholar 

  55. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The authortopic model for authors and documents. In UAI, 2004.

    Google Scholar 

  56. A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proc. VLDB Endow., 3:703–710, September 2010.

    Google Scholar 

  57. Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. JASA, 101, 2006.

    Google Scholar 

  58. I. Titov and R. McDonald. Modeling online reviews with multigrain topic models. In WWW, pages 111–120, 2008.

    Google Scholar 

  59. J. Varadarajan, R. Emonet, and J. Odobez. Probabilistic latent sequential motifs: Discovering temporal activity patterns in video scenes. In BMVC 2010, volume 42, pages 177–196, 2010.

    Google Scholar 

  60. H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In NIPS, pages 1973–1981, 2009.

    Google Scholar 

  61. H.Wallach, I. Murray, R. Salakhutdinov and D. Mimno. Evaluation methods for topic models In ICML, pages 1105–1112, 2009.

    Google Scholar 

  62. H. Wallach. Topic modeling: beyond bag-of-words. In ICML, 2006.

    Google Scholar 

  63. Q. Wang, J. Xu, and H. Li. Regularized latent semantic indexing. In SIGIR, 2011.

    Google Scholar 

  64. Y. Wang and E. Agichtein. Temporal latent semantic analysis for collaboratively generated content: preliminary results. In SIGIR, pages 1145—-1146, 2011.

    Google Scholar 

  65. X. Wei and W. Bruce Croft. LDA-based document models for adhoc retrieval. In SIGIR, pages 178–185, 2006.

    Google Scholar 

  66. F. Yan, N. Xu, and Y. Qi. Parallel inference for latent Dirichlet allocation on graphics processing units. In NIPS, pages 2134–2142. 2009.

    Google Scholar 

  67. S. Yang, J. Bian, and H. Zha. Hybrid generative/discriminative learning for automatic image annotation. In UAI, 2010.

    Google Scholar 

  68. S. Yang, S. Crain, and H. Zha. Briding the language gap: topic-level adaptation for cross-domain knowledge transfer. In AIStat, 2011.

    Google Scholar 

  69. S. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, and H. Zha. Like like alike – joint friendship and interest propagation in social networks. In WWW, 2011.

    Google Scholar 

  70. S. Yang and H. Zha. Language pyramid and multi-scale text analysis. In CIKM, pages 639–648, 2010.

    Google Scholar 

  71. S. Yang, H. Zha, and B. Hu. Dirichlet-bernoulli alignment: A generative model for multi-class multi-label multi-instance corpora. In NIPS, 2009.

    Google Scholar 

  72. L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD, pages 937–946, 2009.

    Google Scholar 

  73. Y. Saad. Numerical Methods for Large Eigenvalue Problems. Manchester University Press ND, 1992.

    Google Scholar 

  74. H. Zha and H. Simon. On updating problems in latent semantic indexing. SIAM Journal on Scientific Computing, 21(2):782, 1999.

    Google Scholar 

  75. H. Zha and Z. Zhang. On matrices with low-rank-plus-shift structures: Partial SVD and latent semantic indexing. SIAM Journal Matrix Analysis and Applications, 21:522–536, 1999.

    Article  MathSciNet  MATH  Google Scholar 

  76. D. Zhou, S. Zhu, K. Yu, X. Song, B. Tseng, H. Zha, and C. Lee Giles. Learning multiple graphs for document recommendations. In WWW, page 141, 2008.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steven P. Crain .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Crain, S.P., Zhou, K., Yang, SH., Zha, H. (2012). Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-3223-4_5

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4614-3222-7

  • Online ISBN: 978-1-4614-3223-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics