Abstract
The cross-lingual topic analysis aims at extracting latent topics from corpora of different languages. Early approaches rely on high-cost multilingual resources (e.g., a parallel corpus), which is hard to come by in many real cases. Some works only require a translation dictionary as a linkage between languages; however, when given an inappropriate dictionary (e.g., small coverage of dictionary), the cross-lingual topic model would shrink to a monolingual topic model and generate less diversified topics. Therefore, it is imperative to investigate a cross-lingual topic model requiring fewer bilingual resources. Recently, some space-mapping techniques have been proposed to help align multiple word embedding of different languages into a quality cross-lingual word embedding by referring to a small number of translation pairs. This work proposes a cross-lingual topic model, called Cb-CLTM, which incorporates with cross-lingual word embedding. To leverage the power of word semantics and the linkage between languages from the cross-lingual word embedding, the Cb-CLTM considers each word as a continuous embedding vector rather than a discrete word type. The experiments demonstrate that, when cross-lingual word space exhibits strong isomorphism, Cb-CLTM can generate more coherent topics with higher diversity and induce better representations of documents across languages for further tasks such as cross-lingual document clustering and classification. When the cross-lingual word space is less isomorphic, Cb-CLTM generates less coherent topics yet still prevails in topic diversity and document classification.
Similar content being viewed by others
Notes
\(p(z_{d_i}=t | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}) = \frac{p(\mathbf {z}, \mathbf {w})}{p(\mathbf {z}_{\lnot {d_i}}, \mathbf {w})} = \frac{p(\mathbf {z}, \mathbf {w})}{p(\mathbf {z}_{\lnot {d_i}}, \mathbf {w}_{\lnot {d_i})p(w_{d_i})}}\), \(p(z_{d_i}=t, w_{d_i} | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}_{\lnot {d_i}} ) = \frac{p(\mathbf {z}, \mathbf {w})}{p(\mathbf {z}_{\lnot {d_i}}, \mathbf {w}_{\lnot {d_i}})}\)
L-BFGS is the abbreviation of Limited–memory Broyden–Fletcher–Goldfarb–Shanno algorithm.
Note that the Cb-CLTM does not have this parameter.
References
Arora S, Ge R, Halpern Y, Mimno D, Moitra A, Sontag D, Wu Y, Zhu M (2013) A practical algorithm for topic modeling with provable guarantees. In: International conference on machine learning, pp 280–288. www.jmlr.org
Artetxe M, Labaka G, Agirre E (2018) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv:1805.06297
Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises-Fisher distributions. J Mach Learn Res 6(Sep):1345–1382
Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (Vol 1: long papers), pp 238–247
Batmanghelich K, Saeedi A, Narasimhan K, Gershman S (2016) Nonparametric spherical topic modeling with word embeddings. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics, ACL 2016 (Vol 2: short papers), p 537
Bianchi F, Terragni S, Hovy D, Nozza D, Fersini E (2020) Cross-lingual contextualized topic models with zero-shot learning. arXiv:2004.07737
Bischof J, Airoldi EM (2012) Summarizing topical content with word frequency and exclusivity. In: Proceedings of the 29th international conference on machine learning (ICML-12), pp 201–208
Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 127–134
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Boyd-Graber J, Blei DM (2009) Multilingual topic models for unaligned text. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, pp 75–82
Chang CH, Hwang SY, Xui TH (2018) Incorporating word embedding into cross-lingual topic modeling. In: 2018 IEEE international Congress on big data (BigData Congress). IEEE, pp 17–24
Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H (2017) Word translation without parallel data. arXiv:1710.04087
Das R, Zaheer M, Dyer C (2015) Gaussian lda for topic models with word embeddings. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Vol 1: long papers), pp 795–804
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019 (Vol 1: long and short papers). Association for Computational Linguistics, pp 4171–4186
Dinu G, Lazaridou A, Baroni M (2014) Improving zero-shot learning by mitigating the hubness problem. arXiv:1412.6568
Esuli A, Moreo A, Sebastiani F (2019) Funnelling: a new ensemble method for heterogeneous transfer learning and its application to cross-lingual text classification. ACM Trans Inf Syst: TOIS 37(3):1–30
Fujinuma Y, Boyd-Graber J, Paul MJ (2019) A resource-free evaluation metric for cross-lingual word embeddings based on graph modularity. In: Proceedings of the 57th annual meeting of the Association for Computational Linguistics, pp 4952–4962
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235
Hall D, Jurafsky D, Manning CD (2008) Studying the history of ideas using topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 363–371
Hao S, Boyd-Graber JL, Paul MJ (2018) Lessons from the bible on modern topics: adapting topic model evaluation to multilingual and low-resource settings. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT, pp 1–6
Hao S, Paul MJ (2018) Learning multilingual topics from incomparable corpora. In: Proceedings of the 27th international conference on computational linguistics, pp 2595–2609
Hao S, Paul MJ (2020) An empirical study on crosslingual transfer in probabilistic topic models. Comput Linguist 46(1):1–40
Heyman G, Vulić I, Moens MF (2016) C-bilda extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content. Data Min Knowl Discov 30(5):1299–1323
Hu Y, Zhai K, Eidelman V, Boyd-Graber J (2014) Polylingual tree-based topic models for translation domain adaptation. In: 52nd annual meeting of the Association for Computational Linguistics, vol 1, pp 1166–1176
Jagarlamudi J, Daumé H (2010) Extracting multilingual topics from unaligned comparable corpora. In: European conference on information retrieval. Springer, pp 444–456
Jiang D, Tong Y, Song Y (2016) Cross-lingual topic discovery from multilingual search engine query log. ACM Trans Inf Syst: TOIS 35(2):9
Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of COLING, vol 2012, pp 1459–1474
Lau JH, Newman D, Baldwin T (2014) Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics, pp 530–539
Liu X, Duh K, Matsumoto Y (2015) Multilingual topic models for bilingual dictionary extraction. ACM Trans Asian Low Resour Lang Inf Process 14(3):11
Ma T, Nasukawa T (2016) Inverted bilingual topic models for lexicon extraction from non-parallel data. arXiv:1612.07215
Mann GS, Mimno D, McCallum A (2006) Bibliometric impact measures leveraging topic analysis. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries. ACM, pp 65–74
McCallum A, Wang X, Corrada-Emmanuel A (2007) Topic and role discovery in social networks with experiments on enron and academic email. J Artif Intell Res 30:249–272
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. arXiv:1309.4168
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mimno D, Wallach HM, Naradowsky J, Smith DA, McCallum A (2009) Polylingual topic models. In: Proceedings of the 2009 conference on empirical methods in natural language processing, EMNLP 2009. ACL, pp 880–889
Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. Trans Assoc Comput Linguist 3:299–313
Ni X, Sun JT, Hu J, Chen Z (2009) Mining multilingual topics from Wikipedia. In: Proceedings of the 18th international conference on World wide web. ACM, pp 1155–1156
Peng N, Wang Y, Dredze M (2014) Learning polylingual topic models from code-switched social media documents. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (Vol 2: short papers), pp 674–679
Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual BERT? arXiv:1906.01502
Reisinger J, Waters A, Silverthorn B, Mooney RJ (2010) Spherical topic models. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 903–910
Ruder S, Vulić I, Søgaard A (2019) A survey of cross-lingual word embedding models. J Artif Intell Res 65:569–631
Schwenk H, Li X (2018) A corpus for multilingual document classification in eight languages. arXiv:1805.09821
Shi B, Lam W, Bing L, Xu Y (2016) Detecting common discussion topics across culture from news reader comments. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics (Vol 1: long papers), pp 676–685
Sievert C, Shirley K (2014) Ldavis: a method for visualizing and interpreting topics. In: Proceedings of the workshop on interactive language learning, visualization, and interfaces, pp 63–70
Smith SL, Turban DH, Hamblin S, Hammerla NY (2017) Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv:1702.03859
Sra S (2012) A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of i s (x). Comput Stat 27(1):177–190
Srivastava A, Sutton C (2017) Autoencoding variational inference for topic models. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, conference track proceedings. www.OpenReview.net
Stajner T, Mladenic D (2019) Cross-lingual document similarity estimation and dictionary generation with comparable corpora. Knowl Inf Syst 58(3):729–743
Tamura A, Sumita E (2016) Bilingual segmented topic model. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics (Vol 1: long papers), pp 1266–1276
Tian L, Wong DF, Chao LS, Quaresma P, Oliveira F, Yi L (2014) Um-corpus: a large English–Chinese parallel corpus for statistical machine translation. In: LREC, pp 1837–1842
Vulić I, De Smet W, Moens MF (2013) Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf Retr 16(3):331–368
Wei X, Croft WB (2006) Lda-based document models for ad-hoc retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 178–185
Yang W, Boyd-Graber J, Resnik P (2019) A multilingual topic model for learning weighted topic links across corpora with low comparability. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1243–1248
Yuan M, Van Durme B, Boyd-Graber J (2018) Multilingual anchoring: interactive topic modeling and alignment across languages. In: Advances in neural information processing systems, vol 2018, pp 8653–8663
Zhang M, Liu Y, Luan H, Sun M (2017) Adversarial training for unsupervised bilingual lexicon induction. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Vol 1: long papers), pp 1959–1970
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3):374–384
Zhou G, Zhu Z, He T, Hu XT (2016) Cross-lingual sentiment classification with stacked autoencoders. Knowl Inf Syst 47(1):27–44
Acknowledgements
This work has been supported in part by Ministry of Science and Technology in Taiwan, under grant no. MOST 106-2410-H-110-017-MY3.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
None.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Collapsed Gibbs Sampler for Topic Assignment
A Collapsed Gibbs Sampler for Topic Assignment
Notice that we omit \(\alpha , \theta , \psi , H^{cs}, l_d\) from distribution \(p(z_{d_i}=t | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}; \alpha , \theta , \psi , H^{cs}, l_d)\) and instead use \(p(z_{d_i}=t | \mathbf {z}_{\lnot {d_i}}, \mathbf {w})\) for brevity, where \(\mathbf {w}\) contains the words of a document.
Rights and permissions
About this article
Cite this article
Chang, CH., Hwang, SY. A word embedding-based approach to cross-lingual topic modeling. Knowl Inf Syst 63, 1529–1555 (2021). https://doi.org/10.1007/s10115-021-01555-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-021-01555-7