Skip to main content
Log in

A word embedding-based approach to cross-lingual topic modeling

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The cross-lingual topic analysis aims at extracting latent topics from corpora of different languages. Early approaches rely on high-cost multilingual resources (e.g., a parallel corpus), which is hard to come by in many real cases. Some works only require a translation dictionary as a linkage between languages; however, when given an inappropriate dictionary (e.g., small coverage of dictionary), the cross-lingual topic model would shrink to a monolingual topic model and generate less diversified topics. Therefore, it is imperative to investigate a cross-lingual topic model requiring fewer bilingual resources. Recently, some space-mapping techniques have been proposed to help align multiple word embedding of different languages into a quality cross-lingual word embedding by referring to a small number of translation pairs. This work proposes a cross-lingual topic model, called Cb-CLTM, which incorporates with cross-lingual word embedding. To leverage the power of word semantics and the linkage between languages from the cross-lingual word embedding, the Cb-CLTM considers each word as a continuous embedding vector rather than a discrete word type. The experiments demonstrate that, when cross-lingual word space exhibits strong isomorphism, Cb-CLTM can generate more coherent topics with higher diversity and induce better representations of documents across languages for further tasks such as cross-lingual document clustering and classification. When the cross-lingual word space is less isomorphic, Cb-CLTM generates less coherent topics yet still prevails in topic diversity and document classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.statmt.org/europarl/.

  2. https://github.com/facebookresearch/MUSE.

  3. https://github.com/google-research/bert/blob/master/multilingual.md.

  4. \(p(z_{d_i}=t | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}) = \frac{p(\mathbf {z}, \mathbf {w})}{p(\mathbf {z}_{\lnot {d_i}}, \mathbf {w})} = \frac{p(\mathbf {z}, \mathbf {w})}{p(\mathbf {z}_{\lnot {d_i}}, \mathbf {w}_{\lnot {d_i})p(w_{d_i})}}\), \(p(z_{d_i}=t, w_{d_i} | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}_{\lnot {d_i}} ) = \frac{p(\mathbf {z}, \mathbf {w})}{p(\mathbf {z}_{\lnot {d_i}}, \mathbf {w}_{\lnot {d_i}})}\)

  5. L-BFGS is the abbreviation of Limited–memory Broyden–Fletcher–Goldfarb–Shanno algorithm.

  6. Note that the Cb-CLTM does not have this parameter.

References

  1. Arora S, Ge R, Halpern Y, Mimno D, Moitra A, Sontag D, Wu Y, Zhu M (2013) A practical algorithm for topic modeling with provable guarantees. In: International conference on machine learning, pp 280–288. www.jmlr.org

  2. Artetxe M, Labaka G, Agirre E (2018) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv:1805.06297

  3. Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises-Fisher distributions. J Mach Learn Res 6(Sep):1345–1382

    MathSciNet  MATH  Google Scholar 

  4. Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (Vol 1: long papers), pp 238–247

  5. Batmanghelich K, Saeedi A, Narasimhan K, Gershman S (2016) Nonparametric spherical topic modeling with word embeddings. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics, ACL 2016 (Vol 2: short papers), p 537

  6. Bianchi F, Terragni S, Hovy D, Nozza D, Fersini E (2020) Cross-lingual contextualized topic models with zero-shot learning. arXiv:2004.07737

  7. Bischof J, Airoldi EM (2012) Summarizing topical content with word frequency and exclusivity. In: Proceedings of the 29th international conference on machine learning (ICML-12), pp 201–208

  8. Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 127–134

  9. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  10. Boyd-Graber J, Blei DM (2009) Multilingual topic models for unaligned text. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, pp 75–82

  11. Chang CH, Hwang SY, Xui TH (2018) Incorporating word embedding into cross-lingual topic modeling. In: 2018 IEEE international Congress on big data (BigData Congress). IEEE, pp 17–24

  12. Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H (2017) Word translation without parallel data. arXiv:1710.04087

  13. Das R, Zaheer M, Dyer C (2015) Gaussian lda for topic models with word embeddings. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Vol 1: long papers), pp 795–804

  14. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019 (Vol 1: long and short papers). Association for Computational Linguistics, pp 4171–4186

  15. Dinu G, Lazaridou A, Baroni M (2014) Improving zero-shot learning by mitigating the hubness problem. arXiv:1412.6568

  16. Esuli A, Moreo A, Sebastiani F (2019) Funnelling: a new ensemble method for heterogeneous transfer learning and its application to cross-lingual text classification. ACM Trans Inf Syst: TOIS 37(3):1–30

    Article  Google Scholar 

  17. Fujinuma Y, Boyd-Graber J, Paul MJ (2019) A resource-free evaluation metric for cross-lingual word embeddings based on graph modularity. In: Proceedings of the 57th annual meeting of the Association for Computational Linguistics, pp 4952–4962

  18. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235

    Article  Google Scholar 

  19. Hall D, Jurafsky D, Manning CD (2008) Studying the history of ideas using topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 363–371

  20. Hao S, Boyd-Graber JL, Paul MJ (2018) Lessons from the bible on modern topics: adapting topic model evaluation to multilingual and low-resource settings. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT, pp 1–6

  21. Hao S, Paul MJ (2018) Learning multilingual topics from incomparable corpora. In: Proceedings of the 27th international conference on computational linguistics, pp 2595–2609

  22. Hao S, Paul MJ (2020) An empirical study on crosslingual transfer in probabilistic topic models. Comput Linguist 46(1):1–40

    Article  Google Scholar 

  23. Heyman G, Vulić I, Moens MF (2016) C-bilda extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content. Data Min Knowl Discov 30(5):1299–1323

    Article  MathSciNet  Google Scholar 

  24. Hu Y, Zhai K, Eidelman V, Boyd-Graber J (2014) Polylingual tree-based topic models for translation domain adaptation. In: 52nd annual meeting of the Association for Computational Linguistics, vol 1, pp 1166–1176

  25. Jagarlamudi J, Daumé H (2010) Extracting multilingual topics from unaligned comparable corpora. In: European conference on information retrieval. Springer, pp 444–456

  26. Jiang D, Tong Y, Song Y (2016) Cross-lingual topic discovery from multilingual search engine query log. ACM Trans Inf Syst: TOIS 35(2):9

    Article  Google Scholar 

  27. Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of COLING, vol 2012, pp 1459–1474

  28. Lau JH, Newman D, Baldwin T (2014) Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics, pp 530–539

  29. Liu X, Duh K, Matsumoto Y (2015) Multilingual topic models for bilingual dictionary extraction. ACM Trans Asian Low Resour Lang Inf Process 14(3):11

    Article  Google Scholar 

  30. Ma T, Nasukawa T (2016) Inverted bilingual topic models for lexicon extraction from non-parallel data. arXiv:1612.07215

  31. Mann GS, Mimno D, McCallum A (2006) Bibliometric impact measures leveraging topic analysis. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries. ACM, pp 65–74

  32. McCallum A, Wang X, Corrada-Emmanuel A (2007) Topic and role discovery in social networks with experiments on enron and academic email. J Artif Intell Res 30:249–272

    Article  Google Scholar 

  33. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  34. Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. arXiv:1309.4168

  35. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  36. Mimno D, Wallach HM, Naradowsky J, Smith DA, McCallum A (2009) Polylingual topic models. In: Proceedings of the 2009 conference on empirical methods in natural language processing, EMNLP 2009. ACL, pp 880–889

  37. Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. Trans Assoc Comput Linguist 3:299–313

    Article  Google Scholar 

  38. Ni X, Sun JT, Hu J, Chen Z (2009) Mining multilingual topics from Wikipedia. In: Proceedings of the 18th international conference on World wide web. ACM, pp 1155–1156

  39. Peng N, Wang Y, Dredze M (2014) Learning polylingual topic models from code-switched social media documents. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (Vol 2: short papers), pp 674–679

  40. Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual BERT? arXiv:1906.01502

  41. Reisinger J, Waters A, Silverthorn B, Mooney RJ (2010) Spherical topic models. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 903–910

  42. Ruder S, Vulić I, Søgaard A (2019) A survey of cross-lingual word embedding models. J Artif Intell Res 65:569–631

    Article  MathSciNet  Google Scholar 

  43. Schwenk H, Li X (2018) A corpus for multilingual document classification in eight languages. arXiv:1805.09821

  44. Shi B, Lam W, Bing L, Xu Y (2016) Detecting common discussion topics across culture from news reader comments. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics (Vol 1: long papers), pp 676–685

  45. Sievert C, Shirley K (2014) Ldavis: a method for visualizing and interpreting topics. In: Proceedings of the workshop on interactive language learning, visualization, and interfaces, pp 63–70

  46. Smith SL, Turban DH, Hamblin S, Hammerla NY (2017) Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv:1702.03859

  47. Sra S (2012) A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of i s (x). Comput Stat 27(1):177–190

    Article  MathSciNet  Google Scholar 

  48. Srivastava A, Sutton C (2017) Autoencoding variational inference for topic models. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, conference track proceedings. www.OpenReview.net

  49. Stajner T, Mladenic D (2019) Cross-lingual document similarity estimation and dictionary generation with comparable corpora. Knowl Inf Syst 58(3):729–743

    Article  Google Scholar 

  50. Tamura A, Sumita E (2016) Bilingual segmented topic model. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics (Vol 1: long papers), pp 1266–1276

  51. Tian L, Wong DF, Chao LS, Quaresma P, Oliveira F, Yi L (2014) Um-corpus: a large English–Chinese parallel corpus for statistical machine translation. In: LREC, pp 1837–1842

  52. Vulić I, De Smet W, Moens MF (2013) Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf Retr 16(3):331–368

    Article  Google Scholar 

  53. Wei X, Croft WB (2006) Lda-based document models for ad-hoc retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 178–185

  54. Yang W, Boyd-Graber J, Resnik P (2019) A multilingual topic model for learning weighted topic links across corpora with low comparability. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1243–1248

  55. Yuan M, Van Durme B, Boyd-Graber J (2018) Multilingual anchoring: interactive topic modeling and alignment across languages. In: Advances in neural information processing systems, vol 2018, pp 8653–8663

  56. Zhang M, Liu Y, Luan H, Sun M (2017) Adversarial training for unsupervised bilingual lexicon induction. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Vol 1: long papers), pp 1959–1970

  57. Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3):374–384

    Article  Google Scholar 

  58. Zhou G, Zhu Z, He T, Hu XT (2016) Cross-lingual sentiment classification with stacked autoencoders. Knowl Inf Syst 47(1):27–44

    Article  Google Scholar 

Download references

Acknowledgements

This work has been supported in part by Ministry of Science and Technology in Taiwan, under grant no. MOST 106-2410-H-110-017-MY3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chia-Hsuan Chang.

Ethics declarations

Conflict of interest

None.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Collapsed Gibbs Sampler for Topic Assignment

A Collapsed Gibbs Sampler for Topic Assignment

Notice that we omit \(\alpha , \theta , \psi , H^{cs}, l_d\) from distribution \(p(z_{d_i}=t | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}; \alpha , \theta , \psi , H^{cs}, l_d)\) and instead use \(p(z_{d_i}=t | \mathbf {z}_{\lnot {d_i}}, \mathbf {w})\) for brevity, where \(\mathbf {w}\) contains the words of a document.

$$\begin{aligned} p(z_{d_i}&=t | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}) \propto p(z_{d_i}=t, w_{d_i} | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}_{\lnot {d_i}} )\\&= \int {p(z_{d_i}=t, w_{d_i}, \theta _{d} | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}_{\lnot {d_i}})\text {d}\theta _{d}}\\&= \int {p(z_{d_i}=t, \theta _{d} | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}_{\lnot {d_i}})\text {d}\theta _{d}} \cdot p(w_{d_i} | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}_{\lnot {d_i}})\\&\quad \propto \underbrace{\int {p(z_{d_i}=t | \theta _{d} )p(\theta _{d} | \mathbf {z}_{\lnot {d_i}}, \mathbf {w}_{\lnot {d_i}})\text {d}\theta _{d}}}_{E(\theta _{d, t})\,of\,Dirichlet} \cdot \phi _t(w_{d_i} | \psi _{z_{d_i}=t}; H^{cs}_{{l_{d}}})\\&= \frac{(N^{t}_{\lnot {d_i}} + \alpha )}{\sum _{t=1}^{T}N_{d}^{t}+\alpha _{t}} \cdot \phi _t(w_{d_{i}} | \psi _{t};H^{cs}_{l_{d}}) \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chang, CH., Hwang, SY. A word embedding-based approach to cross-lingual topic modeling. Knowl Inf Syst 63, 1529–1555 (2021). https://doi.org/10.1007/s10115-021-01555-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-021-01555-7

Keywords

Navigation