Abstract
The simplest and effective way to store human knowledge through centuries was using text. Along with the advancement of technology nowadays, the volume of text has grown to be larger and larger. To extract useful information from this amount of text becomes an exceptionally complex task. As an effort to solve that problem, in this paper, we present a pipeline to extract core knowledge from large quantity text using distributed computing. The components of our pipeline are systems that were known to yield good results. The outputs of our proposed system are stored in a knowledge graph. A knowledge graph is a graph for storing knowledge in the form of triples (head, relation, tail). Some of the existing knowledge graphs in the world are Google knowledge graph, YAGO, DBLP, or DBpedia. These knowledge graphs have one thing in common—they are in English. The English language is studied by many researchers in the world and it had become a rich-resource language (with many natural language processing tools and data set). Vietnamese, on the other hand, is a low-resource language. Therefore, we use cross-lingual transfer method to build a Vietnamese knowledge graph. Firstly, we collect data in form of text about Vietnam tourism, which was written mostly in Vietnamese, using Google search and Wikipedia. In the next step, we translate them into English with Google Translate and use English Natural Language Processing tools like Stanford Parser, Co-referencing, ClausIE, MinIE to extract useful triples from this text. Lastly, the triples are translated back to Vietnamese to build a Vietnam tourism knowledge graph. Since we are working with massive text, we develop a distributed algorithm to extract triples from sentences of massive text. This is a distributed version of MinIE, which was originally developed for a single machine model. In Apache Spark framework, we divide massive text into many smaller parts and move them to the worker nodes with distributed MinIE function. Spark distributed MinIE will extract the triples of sentences in the local text of this worker node in parallel. Finally, the result of worker nodes will be sent back to the master node for building the knowledge graph. We conduct experiments with the distributed MinIE on spark cluster to prove the outperformance of our proposed algorithm.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Hossain MS, Muhammad G, Abdul W, Song B et al (2018) Cloud-assisted secure video transmission and sharing framework for smart cities. Future Gener Comput Sys 83:596–606
Dorgham O, Al-Rahamneh B, Almomani A, Khatatneh KF (2018) Enhancing the security of exchanging and storing DICOM medical images on the cloud. Int J Cloud Appl Compu (IJCAC) 8(1):154–172
Hossain K, Rahman M, Roy S (2019) Iot data compression and optimization techniques in cloud storage: current prospects and future directions. Int J Cloud Appl Compu (IJCAC) 9(2):43–59
Lazib L, Zhao Y, Qin B, Liu T (2019) Negation scope detection with recurrent neural networks models in review texts. Int J High Perform Comput Netw 13(2):211–221
Al-Ayyoub M, Nuseir A, Alsmearat K, Jararweh Y, Gupta B (2018) Deep learning for Arabic NLP: a survey. J Comput Sci 26:522–531
P Do, A System for Natural Language Interaction With the Heterogeneous Information Network, (2019) in Handbook of Research on Cloud Computing and Big Data Applications in IoT, IGI Global Publishing, 271–301.
Caroro RA, Paredes RK, Lumasag JM (2020) Rules for Orthographic Word Parsing of the Philippines’ Cebuano-Visayan Language Using Context-Free Grammars. International J Softw Sci and Comput Intell (IJSSCI) 12(2):34–49
Jadad HA, Touzene A, Day K, Alziedi N, Arafeh B (2019) Context-aware prediction model for offloading mobile application tasks to mobile cloud environments. International J Cloud Appl and Comput (IJCAC) 9(3):58–74
Al-Smadi M, Qawasmeh O, Al-Ayyoub M, Jararweh Y, Gupta B (2018) Deep recurrent neural network versus support vector machine for aspect-based sentiment analysis of Arabic hotels’ reviews. J comput sci 27:386–393
Gavrilov AD, Jordache A, Vasdani M, Deng J (2018) Preventing model overfitting and underfitting in convolutional neural networks. International J Softw Sci and Comput Intell (IJSSCI) 10(4):19–28
T Wolf, L Debut, V Sanh, J Chaumond, C Delangue, P Cistac, T Rault, R Louf, M Funtowicz and J Brew, (2019) "Transformers: State-of-the-art Natural Language Processing," ArXiv
L Ehrlinger and W. Wöß, (2016) Towards a Definition of Knowledge Graphs
R Yadav (2015) Spark Cookbook., Packt Publishing
L Corro and R Gemulla, (2013) ClausIE: Clause-based open information extraction, WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web, 355–366
RGLdC Kiril Gashteovski, (2017) MinIE: Minimizing Facts in Open Information Extraction,in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP
D Talb, Introducing the Natural Language Processing Library for Apache Spark, Posted in Engineering Blog , 19 yOctobe 2017 . [Online]. Available: https://databricks.com/blog/2017/10/19/introducing-natural-language-processing-library-apache-spark.html. [Accessed 12 4 2019].
Kejriwal M (2019) Domain-specific knowledge graph construction. Springer, Heidelberg
Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes P, Hellmann S, Morsey M, Van Kleef P, Auer S, Bizer C (2014) DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semant Web J 6(167):195
Andreas Blumauer and Helmut Nagy (2020) The Knowledge Graph CookBook, Semantic Web Company
F Suchanek, G Kasneci and G Weikum, (2007) "YAGO: a core of semantic knowledge," 16th International World Wide Web Conference, WWW2007, 697–706
TP Tanon, G Weikum and F Suchanek, (2020) YAGO 4: A Reason-able Knowledge Base, in The Semantic Web, 17th International Conference, ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, Proceedings.
"Spark NLP," 2019. [Online]. Available: https://nlp.johnsnowlabs.com/. [Accessed 12 Jan 2020].
"Resolving coreference with neuralcoref," [Online]. Available: https://www.kaggle.com/mamamot/resolving-coreference-with-neuralcoref. [Accessed 12 5 2019].
T. S. N. L. P. Group "The Stanford Parser: A statistical parser," [Online]. Available: https://nlp.stanford.edu/software/lex-parser.shtml. [Accessed 12 8 2017].
D. L. Tomasz Drabas (2017) Learning PySpark., Packt Publishing
Al-Qerem A, Alauthman M, Almomani A et al (2020) IoT transaction processing through cooperative concurrency control on fog–cloud computing environment. Soft Comput 24(8):5695–5711
Bhushan K, Gupta BB (2019) Distributed denial of service (DDoS) attack mitigation in software defined network (SDN)-based cloud computing environment. J Ambient Intel Human Comput 10(5):1985–1997
Acknowledgements
This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCMC) under the Grant Number DS2020-26-01
Author information
Authors and Affiliations
Contributions
All authors contributed equally.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Do, P., Phan, T., Le, H. et al. Building a knowledge graph by using cross-lingual transfer method and distributed MinIE algorithm on apache spark. Neural Comput & Applic 34, 8393–8409 (2022). https://doi.org/10.1007/s00521-020-05495-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05495-1