Domain Dictionary-Based Topic Modeling for Social Text

Jiang, Bo; Liang, Jiguang; Sha, Ying; Li, Rui; Wang, Lihong

doi:10.1007/978-3-319-48740-3_8

Bo Jiang¹⁹,
Jiguang Liang¹⁹,
Ying Sha¹⁹,
Rui Li¹⁹ &
…
Lihong Wang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10041))

Included in the following conference series:

International Conference on Web Information Systems Engineering

Abstract

Online social networks are becoming increasingly popular and posting large volumes of unstructured social text documents every day. Inferring topics from large-scale social texts is a significant but challenging task for many text mining applications. Conventional topic models has been shown unsatisfactory results due to the sparsity and noise of content in short texts. Besides, the learned topics are very difficult to understand the semantic information only by the top weighted terms. In this paper, we propose a novel social text topic modeling method to deal with the problems. The proposed model utilizes topic domain dictionary to construct a weakly supervised matrix, which can play a role of making reference matrix and the learned topic matrix become similar. Experimental results on the constructed social text dataset from Twitter demonstrate that our proposed method can outperform the state-of-the art baselines significantly and also improve the semantic relevancy of the learned topic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Sparse Topical Coding with Sparse Groups

PSLDA: a novel supervised pseudo document-based topic model for short texts

Article 27 May 2022

Intensity of Relationship Between Words: Using Word Triangles in Topic Discovery for Short Texts

Notes

References

Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25–32. ACM (2009)
Google Scholar
Balasubramanyan, R., Cohen, W.W.: Regularization of latent variable models to obtain sparsity. In: SDM, pp. 414–422. SIAM (2013)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Boyd-Graber, J.L., Blei, D.M.: Syntactic topic models. In: Advances in Neural Information Processing Systems, pp. 185–192 (2009)
Google Scholar
Basave, A.E.C. He, Y., Xu, R.: Automatic labelling of topic models learned from twitter by summarisation. Association for Computational Linguistics (ACL) (2014)
Google Scholar
Cheng, X., Yan, X., Lan, Y., Guo, J.: Btm: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)
Article Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JAsIs 41(6), 391–407 (1990)
Article Google Scholar
Dredze, M., Wallach, H.M., Puller, D., Pereira, F.: Generating summary keywords for emails using topics. In: Proceedings of the 13th International Conference on Intelligent User Interfaces, pp. 199–206. ACM (2008)
Google Scholar
Eisenstein, J., Ahmed, A., Xing, E.P.: Sparse additive generative models of text (2011)
Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
Article Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Google Scholar
Yuening, H., Boyd-Graber, J., Satinoff, B., Smith, A.: Interactive topic modeling. Mach. Learn. 95(3), 423–469 (2014)
Article MathSciNet Google Scholar
Hulpus, I., Hayes, C., Karnstedt, M., Greene, D.: Unsupervised graph-based topic labelling using dbpedia. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 465–474. ACM (2013)
Google Scholar
Jagarlamudi, J., Daumé III, H., Udupa, R.: Incorporating lexical priors into topic models. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 204–213. Association for Computational Linguistics (2012)
Google Scholar
Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)
Google Scholar
Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Association for Computational Linguistics (2011)
Google Scholar
Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: Proceedings of the Association for Computational Linguistics, pp. 530–539 (2014)
Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Article Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp. 556–562 (2001)
Google Scholar
Mei, Q., Shen, X., Zhai, C.: Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 490–499. ACM (2007)
Google Scholar
Paul, M.J., Dredze, M.: You are what you tweet: analyzing twitter for public health. In: ICWSM, pp. 265–272 (2011)
Google Scholar
Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)
Google Scholar
Quercia, D., Askham, H., Crowcroft, J.: Tweetlda: supervised topic classification and link prediction in twitter. In: Proceedings of the 4th Annual ACM Web Science Conference, pp. 247–250. ACM (2012)
Google Scholar
Ramage, D., Dumais, S.T., Liebling, D.J.: Characterizing microblogs with topic models. In: ICWSM, vol. 10, p. 1 (2010)
Google Scholar
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408. ACM (2015)
Google Scholar
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 841–842. ACM (2010)
Google Scholar
Wang, C., Blei, D.M.: Decoupling sparsity and smoothness in the discrete hierarchicaldirichlet process. In: Advances in Neural Information Processing Systems, pp. 1982–1989 (2009)
Google Scholar
Wang, D., Li, T., Zhu, S., Ding, C.: Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–314. ACM (2008)
Google Scholar
Wang, Q., Jun, X., Li, H., Craswell, N.: Regularized latent semantic indexing: a new approach to large-scale topic modeling. ACM Trans. Inf. Syst. (TOIS) 31(1), 5 (2013)
Article Google Scholar
Wang, X., McCallum, A.: Topics over time: a non-markov continuous-time model of topicaltrends. In: Proceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pp. 424–433. ACM (2006)
Google Scholar
Williamson, S., Wang, C., Heller, K.A., Blei, D.M.: The ibp compound dirichlet process and its application to focused topic modeling. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 1151–1158 (2010)
Google Scholar
Wu, Y., Wu, W., Li, Z., Zhou, M.: Mining query subtopics from questions in community question answering. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Yan, X., Guo, J., Liu, S., Cheng, X., Wang, Y.: Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the SIAM International Conference on Data Mining (2013)
Google Scholar
Yang, S.-H., Kolcz, A., Schlaikjer, A., Gupta, P.: Large-scale high-precision topic modeling on twitter. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1907–1916. ACM (2014)
Google Scholar
Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20161-5_34
Chapter Google Scholar
Zhu, S., Yu, K., Chi, Y., Gong, Y.: Combining content and link for classification using matrix factorization. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 487–494. ACM (2007)
Google Scholar

Download references

Acknowledgments

This work was supported by National Key Technology R&D Program(No. 2012BAH46B03), and the Strategic Leading Science and Technology Projects of Chinese Academy of Sciences(No. XDA06030200).

Author information

Authors and Affiliations

National Engineering Laboratory for Information Security Technologies, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, 100093, China
Bo Jiang, Jiguang Liang, Ying Sha, Rui Li & Lihong Wang

Authors

Bo Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Jiguang Liang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Sha
View author publications
You can also search for this author in PubMed Google Scholar
Rui Li
View author publications
You can also search for this author in PubMed Google Scholar
Lihong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Sha .

Editor information

Editors and Affiliations

Poznań University of Economics, Poznan, Poland
Wojciech Cellary
University of Minnesota, Minneapolis, Minnesota, USA
Mohamed F. Mokbel
Tsinghua University, Beijing, China
Jianmin Wang
Victoria University, Melbourne, Victoria, Australia
Hua Wang
Victoria University, Melbourne, Victoria, Australia
Rui Zhou
Victoria University, Melbourne, Victoria, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, B., Liang, J., Sha, Y., Li, R., Wang, L. (2016). Domain Dictionary-Based Topic Modeling for Social Text. In: Cellary, W., Mokbel, M., Wang, J., Wang, H., Zhou, R., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2016. WISE 2016. Lecture Notes in Computer Science(), vol 10041. Springer, Cham. https://doi.org/10.1007/978-3-319-48740-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-48740-3_8
Published: 02 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48739-7
Online ISBN: 978-3-319-48740-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics