Abstract
LDA has been proved effective in modeling the semantic relation between surface words. This semantic information in the document collection is useful to measure the topic distribution for a document. In general, a surface word may significantly contribute to several topics in a document collection. LDA measures the contribution of a surface word to each topic and considers a surface word to be identical across all documents. However, a surface word may present different signatures in different contexts, i.e., polysemous words can be used with different senses in different contexts. Intuitively, disambiguating word senses for topic models can enhance their discriminative capabilities. In this work, we propose a joint model to automatically induce document topics and word senses simultaneously. Instead of using some pre-defined word sense resources, we capture the word sense information via a latent variable and directly induce them in a fully unsupervised manner from the corpora. Experimental results show that the proposed joint model outperforms the baselines significantly in document clustering and improves the word sense induction as well against a standalone non-parametric model.
Similar content being viewed by others
Notes
Note that in this paper, we use Dir to represent Dirichelt distribution while we use DP to represent Dirichlet Process.
\(p(z|w)\) can be calculated with \( p(z|w) \propto p(w|z) \Sigma p(z|d)p(d) \) where \(p(w|z)\) and \(p(z|d)\) are parameters of the model that can be estimated while we estimate \(p(d)\) to be the proportion of \(d\)’s document length to the length of the entire document collection.
References
Agirre E, Soroa A (2007) Semeval-2007 task 02: Evaluating word sense induction and discrimination systems. In: Proceedings of the 4th International Workshop on Semantic Evaluations, Association for Computational Linguistics, Stroudsburg, PA, USA, SemEval ’07, pp 7–12. http://dl.acm.org/citation.cfm?id=1621474.1621476
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022. http://dl.acm.org/citation.cfm?id=944919.944937
Boyd-Graber J, Blei D (2007) Putop: Turning predominant senses into a topic model for word sense disambiguation. In: Proceedings of the 4th International Workshop on Semantic Evaluations, Association for Computational Linguistics, Stroudsburg, PA, USA, SemEval ’07, pp 277–281. http://dl.acm.org/citation.cfm?id=1621474.1621534
Boyd-Graber JL, Blei DM, Zhu X (2007) A topic model for word sense disambiguation. In: EMNLP-CoNLL, pp 1024–1033
Brody S, Lapata M (2009) Bayesian word sense induction. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, EACL ’09, pp 103–111. http://dl.acm.org/citation.cfm?id=1609067.1609078
Cambria E, White B (2014) Jumping NLP curves: a review of natural language processing research. IEEE Comput Intell Mag 9(2):48–57. doi:10.1109/MCI.2014.2307227
Chemudugunta C, Smyth P, Steyvers M (2008) Combining concept hierarchies and statistical topic models. In: Proceedings of the 17th ACM conference on Information and knowledge management, ACM, New York, NY, USA, CIKM ’08, pp 1469–1470. doi:10.1145/1458082.1458337. http://doi.acm.org/10.1145/1458082.1458337
Denkowski M (2009) A survey of techniques for unsupervised word sense induction. Language and Statistics II Literature Review.
Dietz L, Bickel S, Scheffer T (2007) Unsupervised prediction of citation influences. In: Proceedings of the 24th International Conference on Machine Learning, pp 233–240
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on Artifical intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, IJCAI’07, pp 1606–1611. http://dl.acm.org/citation.cfm?id=1625275.1625535
Griffiths TL, Steyvers M (2004) Finding scientific topics. PNAS 101(suppl. 1):5228–5235
Guo W, Diab M (2011) Semantic topic models: combining word distributional statistics and dictionary definitions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’11, pp 552–561. http://dl.acm.org/citation.cfm?id=2145432.2145496
Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: Proc. of the SIGIR 2003 Semantic Web Workshop, pp 541–544.
Hovy E, Marcus M, Palmer M, Ramshaw L, Weischedel R (2006) Ontonotes: the 90% solution. Proceedings of the human language technology conference of the NAACL. Companion Volume, Short Papers, Association for Computational Linguistics , pp 57–60
Huang HH, Kuo YH (2010) Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach. Trans Fuz Sys 18(6), pp. 1098–1111. doi:10.1109/TFUZZ.2010.2065811
Klapaftis IP, Manandhar S (2013) Evaluating word sense induction and disambiguation methods. Lang Resour Eval 47(3):579–605. doi:10.1007/s10579-012-9205-0
Kong J, Graff D (2005) Tdt4 multilingual broadcast news speech corpus. Linguistic Data Consortium. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp
Lau RYK, Xia Y, Ye Y (2014) A probabilistic generative model for mining cybercriminal networks from online social media. IEEE Comput Intell Mag 9(1):31–43. doi:10.1109/MCI.2013.2291689
Lewis DD (1997) Reuters-21578 text categorization test collection, distribution 1.0. http://www.research.att.com/~lewis/reuters21578.html
Li L, Roth B, Sporleder C (2010) Topic models for word sense disambiguation and token-based idiom detection. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Stroudsburg, PA, pp 1138–1147
McCarthy D, Koeling R, Weeds J, Carroll J (2004) Finding predominant word senses in untagged text. In: Proceedings of the 42Nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’04. doi:10.3115/1218955.1218991
Navigli R (2009) Word sense disambiguation: A survey. ACM Comput Surv 41(2):10:1–10:69. doi:10.1145/1459352.1459355
Navigli R, Crisafulli G (2010) Inducing word senses to improve web search result clustering. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’10, pp 116–126, URL http://dl.acm.org/citation.cfm?id=1870658.1870670
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11), pp. 613–620. doi: 10.1145/361219.361220
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. Proceedings of international conference on new methods in language processing, Manchester, UK 12:44–49
Schtze H, Pedersen J (1995) Information retrieval based on word senses. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval.
Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, SIGIR ’00, pp 208–215. doi:10.1145/345508.345578
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In. In KDD Workshop on Text Mining.
Stokoe C, Oakes MP, Tait J (2003) Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, ACM, New York, NY, USA, SIGIR ’03, pp 159–166. doi:10.1145/860435.860466
Teh YW, Jordan MI, Beal MJ, Blei DM (2004) Hierarchical dirichlet processes. Journal of the American Statistical Association 101.
Tufiş D, Koeva S (2007) Ontology-supported text classification based on cross-lingual word sense disambiguation. In: Applications of Fuzzy Sets Theory. Springer, Berlin, pp 447–455.
Wang X, McCallum A, Wei X (2007) Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, IEEE Computer Society, Washington, DC, USA, ICDM ’07, pp 697–702. doi:10.1109/ICDM.2007.86
Yao X, Van Durme B (2011) Nonparametric bayesian word sense induction. In: Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing, Association for Computational Linguistics, pp 10–14. http://cs.jhu.edu/xuchen/paper/Yao2011WSI.slides.pdf. http://cs.jhu.edu/xuchen/paper/Yao2011WSI.pdf
Acknowledgments
We thank the reviewers for the insightful comments. This work is supported by Natural Science Foundation of China (NSFC: 61272233).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by L. Xie.
Rights and permissions
About this article
Cite this article
Tang, G., Xia, Y., Sun, J. et al. Statistical word sense aware topic models. Soft Comput 19, 13–27 (2015). https://doi.org/10.1007/s00500-014-1372-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-014-1372-z