Statistical word sense aware topic models

Tang, Guoyu; Xia, Yunqing; Sun, Jun; Zhang, Min; Zheng, Thomas Fang

doi:10.1007/s00500-014-1372-z

Statistical word sense aware topic models

Focus
Published: 19 July 2014

Volume 19, pages 13–27, (2015)
Cite this article

Soft Computing Aims and scope Submit manuscript

Guoyu Tang¹,
Yunqing Xia¹,
Jun Sun²,
Min Zhang³ &
…
Thomas Fang Zheng¹

318 Accesses
5 Citations
Explore all metrics

Abstract

LDA has been proved effective in modeling the semantic relation between surface words. This semantic information in the document collection is useful to measure the topic distribution for a document. In general, a surface word may significantly contribute to several topics in a document collection. LDA measures the contribution of a surface word to each topic and considers a surface word to be identical across all documents. However, a surface word may present different signatures in different contexts, i.e., polysemous words can be used with different senses in different contexts. Intuitively, disambiguating word senses for topic models can enhance their discriminative capabilities. In this work, we propose a joint model to automatically induce document topics and word senses simultaneously. Instead of using some pre-defined word sense resources, we capture the word sense information via a latent variable and directly induce them in a fully unsupervised manner from the corpora. Experimental results show that the proposed joint model outperforms the baselines significantly in document clustering and improves the word sense induction as well against a standalone non-parametric model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

Hamed Jelodar, Yongli Wang, … Liang Zhao

A survey on neural topic models: methods, applications, and challenges

Article Open access 25 January 2024

Xiaobao Wu, Thong Nguyen & Anh Tuan Luu

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Vivek Mehta, Mohit Agarwal & Rohit Kumar Kaliyar

Notes

Note that in this paper, we use Dir to represent Dirichelt distribution while we use DP to represent Dirichlet Process.
\(p(z|w)\) can be calculated with \( p(z|w) \propto p(w|z) \Sigma p(z|d)p(d) \) where \(p(w|z)\) and \(p(z|d)\) are parameters of the model that can be estimated while we estimate \(p(d)\) to be the proportion of \(d\)’s document length to the length of the entire document collection.

References

Agirre E, Soroa A (2007) Semeval-2007 task 02: Evaluating word sense induction and discrimination systems. In: Proceedings of the 4th International Workshop on Semantic Evaluations, Association for Computational Linguistics, Stroudsburg, PA, USA, SemEval ’07, pp 7–12. http://dl.acm.org/citation.cfm?id=1621474.1621476
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022. http://dl.acm.org/citation.cfm?id=944919.944937
Boyd-Graber J, Blei D (2007) Putop: Turning predominant senses into a topic model for word sense disambiguation. In: Proceedings of the 4th International Workshop on Semantic Evaluations, Association for Computational Linguistics, Stroudsburg, PA, USA, SemEval ’07, pp 277–281. http://dl.acm.org/citation.cfm?id=1621474.1621534
Boyd-Graber JL, Blei DM, Zhu X (2007) A topic model for word sense disambiguation. In: EMNLP-CoNLL, pp 1024–1033
Brody S, Lapata M (2009) Bayesian word sense induction. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, EACL ’09, pp 103–111. http://dl.acm.org/citation.cfm?id=1609067.1609078
Cambria E, White B (2014) Jumping NLP curves: a review of natural language processing research. IEEE Comput Intell Mag 9(2):48–57. doi:10.1109/MCI.2014.2307227
Chemudugunta C, Smyth P, Steyvers M (2008) Combining concept hierarchies and statistical topic models. In: Proceedings of the 17th ACM conference on Information and knowledge management, ACM, New York, NY, USA, CIKM ’08, pp 1469–1470. doi:10.1145/1458082.1458337. http://doi.acm.org/10.1145/1458082.1458337
Denkowski M (2009) A survey of techniques for unsupervised word sense induction. Language and Statistics II Literature Review.
Dietz L, Bickel S, Scheffer T (2007) Unsupervised prediction of citation influences. In: Proceedings of the 24th International Conference on Machine Learning, pp 233–240
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on Artifical intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, IJCAI’07, pp 1606–1611. http://dl.acm.org/citation.cfm?id=1625275.1625535
Griffiths TL, Steyvers M (2004) Finding scientific topics. PNAS 101(suppl. 1):5228–5235
Article Google Scholar
Guo W, Diab M (2011) Semantic topic models: combining word distributional statistics and dictionary definitions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’11, pp 552–561. http://dl.acm.org/citation.cfm?id=2145432.2145496
Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: Proc. of the SIGIR 2003 Semantic Web Workshop, pp 541–544.
Hovy E, Marcus M, Palmer M, Ramshaw L, Weischedel R (2006) Ontonotes: the 90% solution. Proceedings of the human language technology conference of the NAACL. Companion Volume, Short Papers, Association for Computational Linguistics , pp 57–60
Huang HH, Kuo YH (2010) Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach. Trans Fuz Sys 18(6), pp. 1098–1111. doi:10.1109/TFUZZ.2010.2065811
Klapaftis IP, Manandhar S (2013) Evaluating word sense induction and disambiguation methods. Lang Resour Eval 47(3):579–605. doi:10.1007/s10579-012-9205-0
Kong J, Graff D (2005) Tdt4 multilingual broadcast news speech corpus. Linguistic Data Consortium. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp
Lau RYK, Xia Y, Ye Y (2014) A probabilistic generative model for mining cybercriminal networks from online social media. IEEE Comput Intell Mag 9(1):31–43. doi:10.1109/MCI.2013.2291689
Lewis DD (1997) Reuters-21578 text categorization test collection, distribution 1.0. http://www.research.att.com/~lewis/reuters21578.html
Li L, Roth B, Sporleder C (2010) Topic models for word sense disambiguation and token-based idiom detection. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Stroudsburg, PA, pp 1138–1147
McCarthy D, Koeling R, Weeds J, Carroll J (2004) Finding predominant word senses in untagged text. In: Proceedings of the 42Nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’04. doi:10.3115/1218955.1218991
Navigli R (2009) Word sense disambiguation: A survey. ACM Comput Surv 41(2):10:1–10:69. doi:10.1145/1459352.1459355
Navigli R, Crisafulli G (2010) Inducing word senses to improve web search result clustering. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’10, pp 116–126, URL http://dl.acm.org/citation.cfm?id=1870658.1870670
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11), pp. 613–620. doi: 10.1145/361219.361220
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. Proceedings of international conference on new methods in language processing, Manchester, UK 12:44–49
Schtze H, Pedersen J (1995) Information retrieval based on word senses. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval.
Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, SIGIR ’00, pp 208–215. doi:10.1145/345508.345578
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In. In KDD Workshop on Text Mining.
Stokoe C, Oakes MP, Tait J (2003) Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, ACM, New York, NY, USA, SIGIR ’03, pp 159–166. doi:10.1145/860435.860466
Teh YW, Jordan MI, Beal MJ, Blei DM (2004) Hierarchical dirichlet processes. Journal of the American Statistical Association 101.
Tufiş D, Koeva S (2007) Ontology-supported text classification based on cross-lingual word sense disambiguation. In: Applications of Fuzzy Sets Theory. Springer, Berlin, pp 447–455.
Wang X, McCallum A, Wei X (2007) Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, IEEE Computer Society, Washington, DC, USA, ICDM ’07, pp 697–702. doi:10.1109/ICDM.2007.86
Yao X, Van Durme B (2011) Nonparametric bayesian word sense induction. In: Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing, Association for Computational Linguistics, pp 10–14. http://cs.jhu.edu/xuchen/paper/Yao2011WSI.slides.pdf. http://cs.jhu.edu/xuchen/paper/Yao2011WSI.pdf

Download references

Acknowledgments

We thank the reviewers for the insightful comments. This work is supported by Natural Science Foundation of China (NSFC: 61272233).

Author information

Authors and Affiliations

Department of Computer Science and Technology, TNList, Tsinghua University, Beijing, China
Guoyu Tang, Yunqing Xia & Thomas Fang Zheng
Institute for Infocomm Research, A-STAR, Singapore, Singapore
Jun Sun
Soochow University, Suzhou, China
Min Zhang

Authors

Guoyu Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yunqing Xia
View author publications
You can also search for this author in PubMed Google Scholar
Jun Sun
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Fang Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunqing Xia.

Additional information

Communicated by L. Xie.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, G., Xia, Y., Sun, J. et al. Statistical word sense aware topic models. Soft Comput 19, 13–27 (2015). https://doi.org/10.1007/s00500-014-1372-z

Download citation

Published: 19 July 2014
Issue Date: January 2015
DOI: https://doi.org/10.1007/s00500-014-1372-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Statistical word sense aware topic models

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

A survey on neural topic models: methods, applications, and challenges

A comprehensive and analytical review of text clustering techniques

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Statistical word sense aware topic models

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

A survey on neural topic models: methods, applications, and challenges

A comprehensive and analytical review of text clustering techniques

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation