Abstract
Existing hierarchical clustering approaches face limitations in correctly representing semantic graphs and heterogeneous multilingual data, thereby limiting their ability to extract insights from complex datasets. The challenge is heightened by the complexity of integrating hierarchical neural topic-seeding models with semantic graphs for multilingual clustering, involving the refined combination of heterogeneous data and structural knowledge. In this context, we propose “the Multilingual Hybrid LDA” (MultiHLDA), a novel 5-phase generative cross-domain approach that integrates prior domain knowledge into hierarchical topic modeling. It involves multilingual distributional term clustering over Fundamental Concepts (FC) composing upper semantic graphs for semi-automatically learning and enhancing a Bottom-Up Universal Upper Semantic Graph (BU3SG) from heterogeneous data. MultiHLDA includes model fine-tuning and FC integration as seed terms to cluster semantically related terms into concepts. We aim to create high-quality and non-overlapping clusters by aligning term clusters with established FC and utilizing noun phrase patterns. Our approach synergizes data with semantic nuances using techniques from both graph mining and machine learning. The empirical findings of this approach emphasize its effectiveness in constructing and enhancing the BU3SG, highlighting its capacity to improve hierarchical topic seeding discovery, clustering, and document/graph representation. This notable progress extends to achievements against trained on ontology, fisheries, and medicine datasets.









Similar content being viewed by others
Data availability statement
Not applicable. The confidentiality and privacy of participant data were ensured, and data were anonymized and securely stored.
References
Apiola M, Saqr M, López-Pernas S. The evolving themes of computing education research: trends, topic models, and emerging research. 2023.
Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3:993–1022.
Peng C, Xia F, Naseriparsa M, Osborne F. Knowledge graphs: opportunities and challenges. 2023.
Harandizadeh B, Hunter Priniski J, Morstatter F. Keyword assisted embedded topic model. In: Proceedings of the fifteenth ACM international conference on web search and data mining, WSDM ’22. 2022. 2022. pp. 372–80.
Wang D, Xu Y, Li M, Duan Z, Wang C, Chen B. Knowledge-aware Bayesian deep topic model. In: 36th NeurIPS 2022: New Orleans, LA, USA, 20 September 2022.
Meng Y, Zhang Y, Huang J, Zhang Y, Zhang C, Han J. Hierarchical topic mining via joint spherical tree and text embedding. In: Proceedings of the 26th ACM SIGKDD conference on knowledge proceedings of the 26th ACM SIGKDD conference on knowledge. 2020. 2020.
Boyd-Graber J, Blei D. Multilingual topic models for unaligned text. UAI. 2009.
Yang W, Boyd-Graber J, Resnik P. A multilingual topic model for learning weighted topic links across corpora with low comparability. In: Conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, Hong Kong, China. 2019. pp. 1243–8.
Xie Q, Zhang X, Ding Y, Song M. Monolingual and multilingual topic analysis using LDA and BERT embeddings. J Inform. 2020;14.
Akçakır G, Jiang Y, Luo J, Noh S. Validating a mixed-method approach for multilingual news framing analysis: a case study of COVID-19. In: Computational communication research. 2023. pp. 1–70.
Maanicshah K, Manouchehri N, Amayri M, Bouguila N. Novel topic models for parallel topics extraction from multilingual text. In: ACIIDS 2023: intelligent information and database systems. 2023. pp. 297–309.
Venugopal M, Sharma VK, Sharma K. Web information mining and semantic analysis in heterogeneous unstructured text data using enhanced latent Dirichlet allocation. In: Wiley oneline library, concurrency and computation: practice and experience, vol. 35. 2023.
Balaji T, Khanna V, Nalini T. A hybrid machine learning approach for document classification: a comparative study. In: 2nd international conference on edge computing and applications (ICECAA), Namakkal, India, 16 August 2023.
Effendi F, Pribadi MR, Widhiarso W, Devyanes S. Hybrid model for topic classification of english learning accounts on instagram using LDA and SVM. In: 10th international conference on electrical engineering, computer science and informatics (EECSI), 31 October 2023.
Gurusamy BM, Rengarajan PK, Srinivasan P. A hybrid approach for text summarization using semantic latent Dirichlet allocation and sentence concept mapping with transformer. Int J Electr Comput Eng. 2023;13:6663–72.
Zhang Y, Pan F, Sui X, Yu K, Li K, Tian S. BioKG: a comprehensive, high-quality biomedical knowledge graph for AI-powered, data-driven biomedical research. 2023.
Huang H, Harzallah M, Guillet F, Xu Z. Core-concept-seeded LDA for ontology learning. In: Procedia computer science 192: 25th international conference on knowledge-based and intelligent information & engineering. 2021. pp. 222–31.
Li C, Chen S, Xing J, Sun A, Ma Z. Seed-guided topic model for document filtering and classification. ACM Trans Inf Syst. 2023.
Lin Y, Gao X, Chu X, Wang Y, Zhao J, Chen C. Enhancing neural topic model with multi-level supervisions from seed words. In: Findings of the association for computational linguistics: ACL 2023. 2023. pp. 13361–77.
Rani M, Dhar AK, Vyas OP. Semi-automatic terminology ontology learning based on topic modelling. Eng Appl Artif Intell. 2017;63:108–25.
Ziwei Xu, Harzallah M, Guillet F, Ichise R. Modular ontology learning with topic modelling over core ontology. Procedia Comput Sci. 2019;159:562–71.
Mechergui A, Karaa WBA, Zghal S. Twice-trained agglomerative clustering approach using topic modeling over generic semantic core knowledge graph. 2023.
Mechergui A, Karaa WBA, Zghal S. A bottom-up generic probabilistic building and enriching approach for knowledge graph using the LDA-based clustering method. 2023.
Tissaoui A, Sassi S, Chbeir R, Mechergui A. A top-down enriching approach for ontology learning from text. Concurr Comput Pract Exp. 2022;19.
Besbes G, Baazaoui-Zghal H. Modular ontologies and CBR-based hybrid system for web information retrieval. Multim Tools Appl. 2015;74:8053–77.
Mustapha NB, Aufaure MA, Zghal HB, Ghezala HB. Modular ontological warehouse for adaptative information search. In: Springer Link: international conference on model and data engineering, vol. 7602. 2012. pp. 79–90.
Blei DM, Griths TL, Jordan MI, Joshua B. Hierarchical topic models and the nested Chinese restaurant process. In: NIPS. 2003.
Mimno DM, Li W, McCallum A. Mixtures of hierarchical topics with Pachinko allocation. In: ICML. 2007.
Perotte AJ, Wood FD, Elhadad N, Bartlett N. Hierarchically supervised latent Dirichlet allocation. In: NIPS. 2011.
Mcauliffe DB, Jon. Supervised topic models. Adv Neural Inf Process Syst. 2008;20.
Mao X, Ming Z, Chua T-S, Li SK, Yan H, Xiaoming. SSHLDA: a semi-supervised hierarchical topic model. In: EMNLP-CoNLL. 2012 .
Duan Z, Xu Y, Chen B, Wang D, Wang C, Zhou M. Topicnet: semantic graph-guided topic discovery. In: 35th conference on neural information processing systems (NeurIPS 2021), Sydney, Australia, 27 October 2021.
Dieng AB, Ruiz FJR, Blei DM. Topic modeling in embedding spaces. Trans Assoc Comput Linguist. 2020;8:439–53.
Duan Z, Wang D, Chen B, Wang C, Chen W, Li Y, Ren J, Zhou M. Sawtooth factorial topic embeddings guided gamma belief network. In: Proceedings of the 38th international conference on machine learning, PMLR, vol. 139. 2021. pp. 2903–13.
Pei S, Yu L, Hoehndorf R, Zhang X. Semi-supervised entity alignment via knowledge graph embedding with awareness of degree difference. In: WWW '19: the world wide web conference. 2019. pp. 3130–6.
Zhu J, Zheng Z, Yang M, Fung GPC, Tang Y. A semi-supervised model for knowledge graph embedding. Data Min Knowl Discov. 2020;34:1–20.
Tianxing Wu, Wang H, Li C, Qi G, Niu X, Wang M, Li L, Shi C. Knowledge graph construction from multiple online encyclopedias. World Wide Web. 2020;23:2671–98.
Wang W, Barnaghi PM, Bargiela A. Learning skos relations for terminological ontologies from text. IGI Glob. 2011;129–52.
Jagarlamudi J, Daumé III H, Raghaven. Incorporating lexical priors into topic models. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics. 2012. pp. 204–13.
Blei DM. Probabilistic topic models. Commun ACM. 2012;4:77–84.
Kurt H. Package ‘NLP’ natural language processing infrastructure. 2022.
Feinerer I, Hornik K. tm: text mining package. A framework for text mining applications within R. 2022.
Grün B, Hornik K, Blei DM, Lafferty JD, Phan X-H, Matsumoto M, Nishimura T, Cokus S. Package ‘topicmodels’. 2022.
Chang J. Package ‘lda’: collapsed Gibbs sampling methods for topic models (version 1.4.2). 2022.
Sievert C, Shirley K. Package ‘LDAvis’ interactive visualization of topic models. 2022.
Griffiths TL, Steyvers M. Finding scientific topics. Proc Natl Acad Sci. 2004;101:5228–35.
Hoffman M, Bach FR, Blei DM. Online learning for latent Dirichlet allocation. In: Advances in neural information processing systems, Citeseer. 2010. pp. 856–64.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
The authors declare no potential conflicts of interest for the research, authorship, and/or publication of this article.
Corresponding author
Ethics declarations
Conflict of interest
This research was conducted in accordance with the relevant ethical standards and guidelines.
Research involving human and/or animals
Not applicable.
Informed consent
The informed consent was obtained from all participants prior to their inclusion in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mechergui, A., Karaa, W.B.A. & Zghal, S. Cross-Domain Multilingual Clustering: A Generative Hybrid Model for Constructing and Enhancing Semantic Graphs from Heterogeneous Data. SN COMPUT. SCI. 5, 1066 (2024). https://doi.org/10.1007/s42979-024-03374-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-024-03374-3