Online Topic Modeling for Short Texts

Roy, Suman; Malladi, Vijay Varma; Sengupta, Ayan; Das, Souparna

doi:10.1007/978-3-030-65310-1_41

Suman Roy¹⁴,
Vijay Varma Malladi¹⁴,
Ayan Sengupta¹⁴ &
…
Souparna Das¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12571))

Included in the following conference series:

International Conference on Service-Oriented Computing

1809 Accesses

Abstract

Retrieval of knowledge from short texts has attracted a lot of attention these days as topic discovery from them can unearth hidden information. In many applications, such topics are needed to be learned on the fly for streaming short texts. In this work we propose an online topic discovery algorithm (OTDA) for short texts. It overcomes the inability of short texts to capture word co-occurrence information by adopting word-context semantic correlation through the skip-gram view of the corpus, following the approach of semantics-assisted NMF (SeaNMF) model due to Shi et al. This OTDA works with one data point or one chunk of data points at a time instead of keeping the entire data in the memory, and also admits the property of memorylessness. We consider a couple of public data sets and an internal data set to conduct experiments using one-pass and multi-pass iterations of the proposed algorithm. The results show encouraging performance of OTDA in terms of average Frobenius loss, Topic Coherence, Normalized Mutual Information (NMI), and emerging topic detection.

S. Das—This work was done when the author was an intern with Optum Global Solutions, Hyderabad during May-June’19.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We assume a low average number of PGD iterations for updating \(\mathbf {U}\) or \(\mathbf {V}\) in one round, and also a low average number of trials needed for implementing the Armijo rule [15, 21].
2.
https://webscope.sandbox.yahoo.com/catalog.php?datatype=l.
3.
Kaggle.com.
4.
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html.
5.
https://radimrehurek.com/gensim/models/ldaseqmodel.html.

References

AlSumait, L., Barbará, D., Domeniconi, C.: On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of ICDM 2008, pp. 3–12 (2008)
Google Scholar
Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of ICML 2006, pp. 113–120 (2006)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bottou, L.: Stochastic learning. In: Advanced Lectures on Machine Learning, ML Summer Schools 2003, Canberra, Australia, Revised Lectures, pp. 146–168 (2003)
Google Scholar
Bucak, S.S., Gunsel, B.: Incremental subspace learning via non-negative matrix factorization. Pattern Recogn. 42(5), 788–797 (2009)
Article Google Scholar
Cao, B., Shen, D., Sun, J.T., Wang, X., Yang, Q., Chen, Z.: Detect and track latent factors with online nonnegative matrix factorization. In: Proceedings of IJCAI 2007, pp. 2689–2694 (2007)
Google Scholar
Cheng, X., Guo, J., Liu, S., Wang, Y., Yan, X.: Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the 13th SIAM International Conference on Data Mining 2013, pp. 749–757 (2013)
Google Scholar
Guan, N., Tao, D., Luo, Z., Yuan, B.: Online nonnegative matrix factorization with robust stochastic approximation. IEEE Trans. Neural Netw. Learn. Syst. 23(7), 1087–1099 (2012)
Article Google Scholar
Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent Dirichlet allocation. In: Advances in Neural Information Processing Systems, vol. 23, pp. 856–864 (2010)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: SIGIR 1999, pp. 50–57. ACM (1999)
Google Scholar
Iwata, T., Yamada, T., Sakurai, Y., Ueda, N.: Online multiscale dynamic topic models. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 663–672 (2010)
Google Scholar
Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Global Optim. 58(2), 285–319 (2013). https://doi.org/10.1007/s10898-013-0035-4
Article MathSciNet MATH Google Scholar
Kuang, D., Choo, J., Park, H.: Nonnegative matrix factorization for interactive topic modeling and document clustering. In: Celebi, M.E. (ed.) Partitional Clustering Algorithms, pp. 215–243. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-09259-1_7
Chapter Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 556–562. MIT Press (2001)
Google Scholar
Lin, C.J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19(10), 2756–2779 (2007)
Article MathSciNet Google Scholar
Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: Proceedings of IJCAI 2015, pp. 2270–2276. AAAI Press (2015)
Google Scholar
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of WSDM 2015, pp. 399–408. ACM (2015)
Google Scholar
Sasaki, K., Yoshikawa, T., Furuhashi, T.: Online topic model for twitter considering dynamics of user interests and topic trends. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of EMNLP 2014, pp. 1977–1985. ACL (2014)
Google Scholar
Shi, T., Kang, K., Choo, J., Reddy, C.K.: Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: Proceedings of WWW 2018, pp. 1105–1114 (2018)
Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Wang, F., Tan, C., König, A.C., Li, P.: Efficient document clustering via online nonnegative matrix factorizations. In: Eleventh SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics (2011)
Google Scholar
Wang, F., Tan, C., Li, P., König, A.C.: Efficient document clustering via online nonnegative matrix factorizations. In: Proceedings of the 11th SIAM International Conference on Data Mining (SDM), pp. 908–919 (2011)
Google Scholar
Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th KDD, pp. 424–433. ACM (2006)
Google Scholar
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2003. ACM (2003)
Google Scholar
Zhong, S.: Efficient streaming text clustering. Neural Netw. 18(5–6), 790–798 (2005)
Article Google Scholar
Zhou, G., Yang, Z., Xie, S., Yang, J.: Online blind source separation using incremental nonnegative matrix factorization with volume constraint. IEEE Trans. Neural Networks 22(4), 550–560 (2011)
Article Google Scholar
Zuo, Y., Wu, J., Zhang, H., Lin, H., Wang, F., Xu, K., Xiong, H.: Topic modeling of short texts: a pseudo-document view. In: KDD 2016, pp. 2105–2114. ACM (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Optum Global Solutions India Pvt. Ltd. (UnitedHealth Group), Bangalore, 560 103, India
Suman Roy, Vijay Varma Malladi & Ayan Sengupta
International Institute of Information Technology (IIIT-H), Hyderabad, Hyderabad, 500 032, India
Souparna Das

Authors

Suman Roy
View author publications
You can also search for this author in PubMed Google Scholar
Vijay Varma Malladi
View author publications
You can also search for this author in PubMed Google Scholar
Ayan Sengupta
View author publications
You can also search for this author in PubMed Google Scholar
Souparna Das
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suman Roy .

Editor information

Editors and Affiliations

Zayed University, Dubai, United Arab Emirates
Eleanna Kafeza
University of New South Wales, Sydney, NSW, Australia
Boualem Benatallah
IIT National Research Council C.N.R., Pisa, Italy
Fabio Martinelli
Zayed University, Dubai, United Arab Emirates
Hakim Hacid
University of Sydney, Darlington, NSW, Australia
Athman Bouguettaya
Ernst & Young AI Lab, San Jose, CA, USA
Hamid Motahari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Roy, S., Malladi, V.V., Sengupta, A., Das, S. (2020). Online Topic Modeling for Short Texts. In: Kafeza, E., Benatallah, B., Martinelli, F., Hacid, H., Bouguettaya, A., Motahari, H. (eds) Service-Oriented Computing. ICSOC 2020. Lecture Notes in Computer Science(), vol 12571. Springer, Cham. https://doi.org/10.1007/978-3-030-65310-1_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-65310-1_41
Published: 09 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65309-5
Online ISBN: 978-3-030-65310-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics