Skip to main content

Domain Dictionary-Based Topic Modeling for Social Text

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10041))

Abstract

Online social networks are becoming increasingly popular and posting large volumes of unstructured social text documents every day. Inferring topics from large-scale social texts is a significant but challenging task for many text mining applications. Conventional topic models has been shown unsatisfactory results due to the sparsity and noise of content in short texts. Besides, the learned topics are very difficult to understand the semantic information only by the top weighted terms. In this paper, we propose a novel social text topic modeling method to deal with the problems. The proposed model utilizes topic domain dictionary to construct a weakly supervised matrix, which can play a role of making reference matrix and the learned topic matrix become similar. Experimental results on the constructed social text dataset from Twitter demonstrate that our proposed method can outperform the state-of-the art baselines significantly and also improve the semantic relevancy of the learned topic.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.ldoceonline.com/.

  2. 2.

    http://www.csie.ntu.edu.tw/cjlin/liblinear/.

References

  1. Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25–32. ACM (2009)

    Google Scholar 

  2. Balasubramanyan, R., Cohen, W.W.: Regularization of latent variable models to obtain sparsity. In: SDM, pp. 414–422. SIAM (2013)

    Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Boyd-Graber, J.L., Blei, D.M.: Syntactic topic models. In: Advances in Neural Information Processing Systems, pp. 185–192 (2009)

    Google Scholar 

  5. Basave, A.E.C. He, Y., Xu, R.: Automatic labelling of topic models learned from twitter by summarisation. Association for Computational Linguistics (ACL) (2014)

    Google Scholar 

  6. Cheng, X., Yan, X., Lan, Y., Guo, J.: Btm: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)

    Article  Google Scholar 

  7. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JAsIs 41(6), 391–407 (1990)

    Article  Google Scholar 

  8. Dredze, M., Wallach, H.M., Puller, D., Pereira, F.: Generating summary keywords for emails using topics. In: Proceedings of the 13th International Conference on Intelligent User Interfaces, pp. 199–206. ACM (2008)

    Google Scholar 

  9. Eisenstein, J., Ahmed, A., Xing, E.P.: Sparse additive generative models of text (2011)

    Google Scholar 

  10. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)

    Article  Google Scholar 

  11. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)

    Google Scholar 

  12. Yuening, H., Boyd-Graber, J., Satinoff, B., Smith, A.: Interactive topic modeling. Mach. Learn. 95(3), 423–469 (2014)

    Article  MathSciNet  Google Scholar 

  13. Hulpus, I., Hayes, C., Karnstedt, M., Greene, D.: Unsupervised graph-based topic labelling using dbpedia. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 465–474. ACM (2013)

    Google Scholar 

  14. Jagarlamudi, J., Daumé III, H., Udupa, R.: Incorporating lexical priors into topic models. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 204–213. Association for Computational Linguistics (2012)

    Google Scholar 

  15. Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)

    Google Scholar 

  16. Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Association for Computational Linguistics (2011)

    Google Scholar 

  17. Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: Proceedings of the Association for Computational Linguistics, pp. 530–539 (2014)

    Google Scholar 

  18. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)

    Article  Google Scholar 

  19. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp. 556–562 (2001)

    Google Scholar 

  20. Mei, Q., Shen, X., Zhai, C.: Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 490–499. ACM (2007)

    Google Scholar 

  21. Paul, M.J., Dredze, M.: You are what you tweet: analyzing twitter for public health. In: ICWSM, pp. 265–272 (2011)

    Google Scholar 

  22. Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)

    Google Scholar 

  23. Quercia, D., Askham, H., Crowcroft, J.: Tweetlda: supervised topic classification and link prediction in twitter. In: Proceedings of the 4th Annual ACM Web Science Conference, pp. 247–250. ACM (2012)

    Google Scholar 

  24. Ramage, D., Dumais, S.T., Liebling, D.J.: Characterizing microblogs with topic models. In: ICWSM, vol. 10, p. 1 (2010)

    Google Scholar 

  25. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408. ACM (2015)

    Google Scholar 

  26. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 841–842. ACM (2010)

    Google Scholar 

  27. Wang, C., Blei, D.M.: Decoupling sparsity and smoothness in the discrete hierarchicaldirichlet process. In: Advances in Neural Information Processing Systems, pp. 1982–1989 (2009)

    Google Scholar 

  28. Wang, D., Li, T., Zhu, S., Ding, C.: Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–314. ACM (2008)

    Google Scholar 

  29. Wang, Q., Jun, X., Li, H., Craswell, N.: Regularized latent semantic indexing: a new approach to large-scale topic modeling. ACM Trans. Inf. Syst. (TOIS) 31(1), 5 (2013)

    Article  Google Scholar 

  30. Wang, X., McCallum, A.: Topics over time: a non-markov continuous-time model of topicaltrends. In: Proceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pp. 424–433. ACM (2006)

    Google Scholar 

  31. Williamson, S., Wang, C., Heller, K.A., Blei, D.M.: The ibp compound dirichlet process and its application to focused topic modeling. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 1151–1158 (2010)

    Google Scholar 

  32. Wu, Y., Wu, W., Li, Z., Zhou, M.: Mining query subtopics from questions in community question answering. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  33. Yan, X., Guo, J., Liu, S., Cheng, X., Wang, Y.: Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the SIAM International Conference on Data Mining (2013)

    Google Scholar 

  34. Yang, S.-H., Kolcz, A., Schlaikjer, A., Gupta, P.: Large-scale high-precision topic modeling on twitter. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1907–1916. ACM (2014)

    Google Scholar 

  35. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20161-5_34

    Chapter  Google Scholar 

  36. Zhu, S., Yu, K., Chi, Y., Gong, Y.: Combining content and link for classification using matrix factorization. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 487–494. ACM (2007)

    Google Scholar 

Download references

Acknowledgments

This work was supported by National Key Technology R&D Program(No. 2012BAH46B03), and the Strategic Leading Science and Technology Projects of Chinese Academy of Sciences(No. XDA06030200).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Sha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Jiang, B., Liang, J., Sha, Y., Li, R., Wang, L. (2016). Domain Dictionary-Based Topic Modeling for Social Text. In: Cellary, W., Mokbel, M., Wang, J., Wang, H., Zhou, R., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2016. WISE 2016. Lecture Notes in Computer Science(), vol 10041. Springer, Cham. https://doi.org/10.1007/978-3-319-48740-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-48740-3_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-48739-7

  • Online ISBN: 978-3-319-48740-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics