Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond

Crain, Steven P.; Zhou, Ke; Yang, Shuang-Hong; Zha, Hongyuan

doi:10.1007/978-1-4614-3223-4_5

Steven P. Crain³,
Ke Zhou³,
Shuang-Hong Yang³ &
…
Hongyuan Zha³

20k Accesses
48 Citations

Abstract

The bag-of-words representation commonly used in text analysis can be analyzed very efficiently and retains a great deal of useful information, but it is also troublesome because the same thought can be expressed using many different terms or one term can have very different meanings. Dimension reduction can collapse together terms that have the same semantics, to identify and disambiguate terms with multiple meanings and to provide a lower-dimensional representation of documents that reflects concepts instead of raw terms. In this chapter, we survey two influential forms of dimension reduction. Latent semantic indexing uses spectral decomposition to identify a lower-dimensional representation that maintains semantic properties of the documents. Topic modeling, including probabilistic latent semantic indexing and latent Dirichlet allocation, is a form of dimension reduction that uses a probabilistic model to find the co-occurrence patterns of terms that correspond to semantic topics in a collection of documents. We describe the basic technologies in detail and expose the underlying mechanism. We also discuss recent advances that have made it possible to apply these techniques to very large and evolving text collections and to incorporate network structure or other contextual information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels. J. Mach. Learn. Res., 9:1981–2014, June 2008.
MATH Google Scholar
D. Andrzejewski, X. Zhu, M. Craven, and B. Recht. A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. In IJCAI, 2011.
Google Scholar
A. Asuncion, M. Welling, P. Smyth, and Y. Teh. On smoothing and inference for topic models. In UAI, pages 27–34, 2009.
Google Scholar
L. Bahl, J. Baker, E. Jelinek, and R. Mercer. Perplexity—a measure of the difficulty of speech recognition tasks. In Program, 94th Meeting of the Acoustical Society of America, volume 62, page S63, 1977.
Google Scholar
H. Bast and D. Majumdar. Why spectral retrieval works. In SIGIR, page 11, 2005.
Google Scholar
J.-P. Benzecri. L’Analyse des Donnees. Volume II. 1973.
Google Scholar
M. Berry. Large-scale sparse singular value computations. The International Journal Of Supercomputer Applications, 6(1):13–49, 1992.
Google Scholar
M. Berry, S. Dumais, and G. O’Brien. Using linear algebra for intelligent information retrieval. SIAM review, 37(4):573–595, 1995.
Article MathSciNet MATH Google Scholar
D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In NIPS, 2003.
Google Scholar
D. Blei and J. Lafferty. Dynamic topic models. In ICML, pages 113–120, 2006.
Google Scholar
D. Blei and J. Lafferty. A correlated topic model of science. AAS, 1(1):17–35, 2007.
MathSciNet MATH Google Scholar
D. Blei and J. McAuliffe. Supervised topic models. In NIPS, 2007.
Google Scholar
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, 2003.
MATH Google Scholar
J. Boyd-Graber and D. Blei. Multilingual topic models for unaligned text. In UAI, pages 75–82, 2009.
Google Scholar
J. Boyd-Graber and D. Blei. Syntactic topic models. In NIPS, pages 185–192. 2009.
Google Scholar
W. Buntine and A. Jakulin. Discrete component analysis. In Craig Saunders, Marko Grobelnik, Steve Gunn, and John Shawe-Taylor, editors, Subspace, Latent Structure and Feature Selection, volume 3940 of Lecture Notes in Computer Science, pages 1–33. Springer Berlin / Heidelberg, 2006.
Google Scholar
J. Chang and D. Blei. Relational topic models for document networks. In AIStats, 2009.
Google Scholar
J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, pages 288–296. 2009.
Google Scholar
K. Church and W. Gale. Poisson mixtures. Natural Language Engineering, 1:163–190, 1995.
Google Scholar
D. Cohn. The missing link-a probabilistic model of document content and hypertext connectivity. In NIPS, 2001.
Google Scholar
D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In ICML, pages 167–174, 2001.
Google Scholar
S. Crain, S.-H. Yang, Y. Jiao, and H. Zha. Dialect topic modeling for improved consumer medical search. In AMIA Annual Symposium, 2010.
Google Scholar
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, September 1990.
Article Google Scholar
A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39(1):1–38, 1977.
MathSciNet MATH Google Scholar
H. Deng, J. Han, B. Zhao, Y. Yu, and C. Lin. Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks. In KDD, pages 1271—-1279, San Diego, 2011. ACM.
Google Scholar
C. Ding. A similarity-based probability model for latent semantic indexing. In SIGIR, pages 58–65, 1999.
Google Scholar
G. Doyle and C. Elkan. Accounting for burstiness in topic models. In ICML, 2009.
Google Scholar
S. Dumais and J. Nielsen. Automating the assignment of submitted manuscripts to reviewers. In SIGIR, pages 233–244, 1992.
Google Scholar
G. Dupret. Latent concepts and the number orthogonal factors in latent semantic analysis. SIGIR, pages 221–226, 2003.
Google Scholar
G. Golub and C. Van Loan. Matrix computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996.
Google Scholar
T. Griffiths and M. Steyvers. Latent Semantic Analysis: A Road to Meaning, chapter Probabilistic topic models. 2006.
Google Scholar
T. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences of the United States of America, volume 101, pages 5228–5235, 2004.
Google Scholar
T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Integrating topics and syntax. In NIPS, pages 537–544, 2005.
Google Scholar
Z, Guo, S. Zhu, Y. Chi, Z. Zhang, and Y. Gong. A latent topic model for linked documents. In SIGIR, page 720, 2009.
Google Scholar
M. Hoffman, D. Blei, and F. Bach. Online learning for latent Dirichlet allocation. In NIPS, pages 856–864, 2010.
Google Scholar
T. Hofmann. Probabilistic latent semantic analysis. In UAI, page 21, 1999.
Google Scholar
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50–57, 1999.
Google Scholar
R. Kubota Ando and L. Lee. Iterative residual rescaling: An analysis and generalization of LSI. In SIGIR, pages 154–162, 2001.
Google Scholar
S. Kullback and R. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, March 1951.
Article MathSciNet MATH Google Scholar
T. Landauer. On the computational basis of learning and cognition: Arguments from LSA. Psychology of learning and motivation, (1):1– 63, 2002.
MathSciNet Google Scholar
W. Li, D. Blei, and A. McCallum. Nonparametric Bayes Pachinko allocation. In UAI, 2007.
Google Scholar
G. Lisowsky and L. Rost. Konkordanz zum hebr¨aischen Alten Testament: nach dem von Paul Kahle in der Biblia Hebraica edidit Rudolf Kittel besorgten Masoretischen Text. Deutsche Bibelgesellschaft, 1958.
Google Scholar
Z. Liu, Y. Zhang, E.Y. Chang, and M. Sun. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol., 2:26:1–26:18, May 2011.
Google Scholar
C. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
Google Scholar
A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In Proceedings of the 19th international joint conference on Artificial intelligence, pages 786–791, 2005.
Google Scholar
Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. In WWW, page 101, 2008.
Google Scholar
Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In KDD, pages 490–499, 2007.
Google Scholar
D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In UAI, 2008.
Google Scholar
D. Mimno, H.Wallach, J. Naradowsky, D. Smith, and A. McCallum. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 880–889, 2009.
Google Scholar
L. Molgaard, J. Larsen, and D. Lyngby. Temporal analysis of text data using latent variable models. 2009 IEEE International Workshop on Machine Learning for Signal Processing, 2009.
Google Scholar
A. Ng, A. Zheng, and M. Jordan. Link analysis, eigenvectors and stability. In International Joint Conference on Artificial Intelligence, volume 17, pages 903–910, 2001.
Google Scholar
G. O’Brien. Information management tools for updating an SVDencoded indexing scheme. Master’s thesis, The University of Knoxville, Tennessee, (October), 1994.
Google Scholar
C. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: A probabilistic analysis. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 159–168, 1998.
Google Scholar
J. Reisinger, A. Waters, B. Silverthorn, and R. Mooney. Spherical topic models. In ICML, pages 903–910, 2010.
Google Scholar
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The authortopic model for authors and documents. In UAI, 2004.
Google Scholar
A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proc. VLDB Endow., 3:703–710, September 2010.
Google Scholar
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. JASA, 101, 2006.
Google Scholar
I. Titov and R. McDonald. Modeling online reviews with multigrain topic models. In WWW, pages 111–120, 2008.
Google Scholar
J. Varadarajan, R. Emonet, and J. Odobez. Probabilistic latent sequential motifs: Discovering temporal activity patterns in video scenes. In BMVC 2010, volume 42, pages 177–196, 2010.
Google Scholar
H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In NIPS, pages 1973–1981, 2009.
Google Scholar
H.Wallach, I. Murray, R. Salakhutdinov and D. Mimno. Evaluation methods for topic models In ICML, pages 1105–1112, 2009.
Google Scholar
H. Wallach. Topic modeling: beyond bag-of-words. In ICML, 2006.
Google Scholar
Q. Wang, J. Xu, and H. Li. Regularized latent semantic indexing. In SIGIR, 2011.
Google Scholar
Y. Wang and E. Agichtein. Temporal latent semantic analysis for collaboratively generated content: preliminary results. In SIGIR, pages 1145—-1146, 2011.
Google Scholar
X. Wei and W. Bruce Croft. LDA-based document models for adhoc retrieval. In SIGIR, pages 178–185, 2006.
Google Scholar
F. Yan, N. Xu, and Y. Qi. Parallel inference for latent Dirichlet allocation on graphics processing units. In NIPS, pages 2134–2142. 2009.
Google Scholar
S. Yang, J. Bian, and H. Zha. Hybrid generative/discriminative learning for automatic image annotation. In UAI, 2010.
Google Scholar
S. Yang, S. Crain, and H. Zha. Briding the language gap: topic-level adaptation for cross-domain knowledge transfer. In AIStat, 2011.
Google Scholar
S. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, and H. Zha. Like like alike – joint friendship and interest propagation in social networks. In WWW, 2011.
Google Scholar
S. Yang and H. Zha. Language pyramid and multi-scale text analysis. In CIKM, pages 639–648, 2010.
Google Scholar
S. Yang, H. Zha, and B. Hu. Dirichlet-bernoulli alignment: A generative model for multi-class multi-label multi-instance corpora. In NIPS, 2009.
Google Scholar
L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD, pages 937–946, 2009.
Google Scholar
Y. Saad. Numerical Methods for Large Eigenvalue Problems. Manchester University Press ND, 1992.
Google Scholar
H. Zha and H. Simon. On updating problems in latent semantic indexing. SIAM Journal on Scientific Computing, 21(2):782, 1999.
Google Scholar
H. Zha and Z. Zhang. On matrices with low-rank-plus-shift structures: Partial SVD and latent semantic indexing. SIAM Journal Matrix Analysis and Applications, 21:522–536, 1999.
Article MathSciNet MATH Google Scholar
D. Zhou, S. Zhu, K. Yu, X. Song, B. Tseng, H. Zha, and C. Lee Giles. Learning multiple graphs for document recommendations. In WWW, page 141, 2008.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computational Science and Engineering College of Computing, Georgia Institute of Technology, Atlanta, USA
Steven P. Crain, Ke Zhou, Shuang-Hong Yang & Hongyuan Zha

Authors

Steven P. Crain
View author publications
You can also search for this author in PubMed Google Scholar
Ke Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Shuang-Hong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hongyuan Zha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven P. Crain .

Editor information

Editors and Affiliations

Thomas J. Watson Research Center, IBM, Skyline Drive 19, Hawthorne, 10532, New York, USA
Charu C. Aggarwal
at Urbana-Champaign, University of Illinois, URBANA, 61801, Illinois, USA
ChengXiang Zhai

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Crain, S.P., Zhou, K., Yang, SH., Zha, H. (2012). Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_5

Download citation

DOI: https://doi.org/10.1007/978-1-4614-3223-4_5
Published: 07 January 2012
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-3222-7
Online ISBN: 978-1-4614-3223-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics