Skip to main content
Log in

A scalable parallel Chinese online encyclopedia knowledge denoising method based on entry tags and Spark cluster

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Because of the open-collaborative of online encyclopedias, a large number of knowledge triples are improperly classified in online encyclopedia systems, and it is necessary to denoise and refine the open-domain encyclopedia Knowledge Bases (KBs) to improve the quality and precision. However, the lack and inaccuracy of triple semantic features lead to a poor refinement effect. In addition, considering large-scale encyclopedia KBs, the processing of massive knowledge will lead to too much computing time and poor scalability of the algorithm. To solve the problems of knowledge denoising in the Chinese encyclopedia system, first, based on data field theory, this paper proposes a new Cartesian product mapping-based method (TripleES) to calculate the semantic similarity of entity triples, based on which a method for quantifying the quality of entry tags is proposed. Second, to further improve the denoising effect on KBs, this paper proposes a new method (TriplePV) to compute the potential value of triple based on multi-feature fusion strategy to calculate the semantic distance between the “out-of-vocabulary” entry tags and embeds it into the potential function. Third, to ensure our algorithms have good scalability, the proposed denoising algorithms are implemented and optimized in parallel based on the Spark cluster-computing framework. Specifically, Spark-based TripleES (ES_Spark) and Spark-based TriplePV (PV_Spark) algorithms are proposed to calculate the semantic similarity and potential value of triples respectively. Finally, a comprehensive comparative analysis is performed on the denoising effect and time efficiency with the state-of-the-art distributed Chinese encyclopedia knowledge denoising algorithm. The experimental results on real-world datasets show that the parallel denoising algorithm proposed in this paper can improve the efficiency of knowledge denoising and the accuracy of KBs, which outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. http://www.wikipedia.org

  2. http://www.baike.com

  3. http://baike.baidu.com

  4. http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

References

  1. Wang T, Li J, Guo J, Xie J (2019) A novel large-scale chinese encyclopedia knowledge parallel refining method based on mapreduce. IEEE Access 7:111840–111857

  2. Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van kleef, P, Auer S et al (2015) Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web 6(2):167–195

    Article  Google Scholar 

  3. Wang T, Gu H, Wu Z, Gao J (2020) Multi-source knowledge integration based on machine learning algorithms for domain ontology. Neural Comput Appl 32(1):235–245

    Article  Google Scholar 

  4. Chen K, Zhang Z, Long J, Zhang H (2016) Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst Appl 66:245–260

    Article  Google Scholar 

  5. Liu F, Shen Y, Zhang T, Gao H (2020) Entity-related paths modeling for knowledge base completion. Front Comput Sci 14(5):1–10. https://doi.org/10.1007/s11704-019-8264-4

    Google Scholar 

  6. Volz J, Bizer C, Gaedke M, Kobilarov G (2009) Discovering and maintaining links on the web of data Hutchison D, Kanade T, Kittler J, Kleinberg JM, Mattern F, Mitchell JC, Naor M, Nierstrasz O, Pandu Rangan C, Steffen B, Sudan M, Terzopoulos D, Tygar D, Vardi MY, Weikum G, Bernstein A, Karger DR, Heath T, Feigenbaum L, Maynard D, Motta E, Thirunarayan K (eds), Springer, Berlin. https://doi.org/10.1007/978-3-642-04930-9_41

  7. Niu X, Sun X, Wang H, Rong S, Qi G, Yu Y (2011) Zhishi. me-weaving chinese linking open data. In: International Semantic Web Conference, pp 205–220

  8. Wang Z, Wang Z, Li J, Pan JZ (2012) Knowledge extraction from chinese wiki encyclopedias. J Zhejiang Univ Sci C 13(4):268–280

    Article  Google Scholar 

  9. Chen T, Liu W, Zhu Q (2018) Sinopedia: an unified chinese terminology service platform based on linked data. J Libr Sci China 44(4):4–18

    Google Scholar 

  10. Niu X, Rong S, Wang H, Yu Y (2012) An effective rule miner for instance matching in a web of data. In: Proceedings of the 21st ACM international conference on Information and knowledge management, pp 1085–1094

  11. Pershina M, Yakout M, Chakrabarti K (2015) Holistic entity matching across knowledge graphs. In: 2015 IEEE International Conference on Big Data (Big Data), pp 1585–1590

  12. Wang Z, Li J, Wang Z, Tang J (2012) Cross-lingual knowledge linking across wiki knowledge bases. In: Proceedings of the 21st international conference on World Wide Web, pp 459–468

  13. Wang Z, Li J, Tang J (2013) Boosting cross-lingual knowledge linking via concept annotation. In: Twenty-Third International Joint Conference on Artificial Intelligence, pp 2733–2739

  14. Wang Y, Wu C, Tsai RT (2016) Cross-language article linking with different knowledge bases using bilingual topic model and translation features. Knowl-Based Syst 111:228–236

    Article  Google Scholar 

  15. Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O (2013) Translating embeddings for modeling multi-relational data. In: Neural Information Processing Systems (NIPS), pp 2787–2795

  16. Wang Z, Zhang J, Feng J, Chen Z (2014) Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 28, pp 1112–1119

  17. Lin Y, Liu Z, Sun M, Liu Y, Zhu X (2015) Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 29, pp 2181–2187

  18. Ji G, He S, Xu L, Liu K, Zhao J (2015) Knowledge graph embedding via dynamic mapping matrix. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), pp 687–696

  19. Xie R, Liu Z, Jia J, Luan H, Sun M (2016) Representation learning of knowledge graphs with entity descriptions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 30, pp 2659–2665

  20. Trisedya BD, Qi J, Zhang R (2019) Entity alignment between knowledge graphs using attribute embeddings. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 297–304

  21. Malaviya C, Bhagavatula C, Bosselut A, Choi Y (2020) Commonsense knowledge base completion with structural and semantic context. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol . 34, pp 2925–2933

  22. Vashishth S, Sanyal S, Nitin V, Agrawal N, Talukdar P (2020) Interacte: Improving convolution-based knowledge graph embeddings by increasing feature interactions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 3009–3016

  23. Bordes A, Weston J, Collobert R, Bengio Y (2011) Learning structured embeddings of knowledge bases. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 25, pp 301–306

  24. Socher R, Chen D, Manning CD, Ng A (2013) Reasoning with neural tensor networks for knowledge base completion. In: Advances in neural information processing systems, pp 926–934

  25. Wang Z, Li J, Liu Z, Tang J (2016) Text-enhanced representation learning for knowledge graph. In: Proceedings of International Joint Conference on Artificial Intelligent (IJCAI), pp 1293–1299

  26. Nickel M, Rosasco L, Poggio T (2016) Holographic embeddings of knowledge graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 30, pp 1955–1961

  27. Chen X, Jia S, Ding L, Shen H, Xiang Y (2020) Sdt: an integrated model for open-world knowledge graph reasoning. Expert Syst Appl 162:1–9. https://doi.org/10.1016/j.eswa.2020.113889

    Article  Google Scholar 

  28. Nizzoli L, Avvenuti M, Tesconi M, Cresci S (2020) Geo-semantic-parsing: Ai-powered geoparsing by traversing semantic knowledge graphs. Decis Support Syst 136:1–16

    Article  Google Scholar 

  29. Li Y, Du G, Xiang Y, Li S, Ma L, Shao D, Wang X, Chen H (2020) Towards chinese clinical named entity recognition by dynamic embedding using domain-specific knowledge. J Biomed Inf:1–9

  30. Xu B, Luo Z, Huang L, Liang B, Xiao Y, Yang D, Wang W (2018) Metic: Multi-instance entity typing from corpus. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp 903–912

  31. Wu T, Qi G, Luo B, Zhang L, Wang H (2019) Language-independent type inference of the instances from multilingual wikipedia. Int J Semant Web Inf Syst (IJSWIS) 15(2):22–46

    Article  Google Scholar 

  32. Khadilkar V, Kantarcioglu M, Thuraisingham B, Castagna P (2012) Jena-hbase: A distributed, scalable and efficient rdf triple store. In: Proceedings of the 11th International Semantic Web Conference Posters & Demonstrations Track, ISWC-PD, Vol 12, pp 85–88

  33. Husain M, McGlothlin J, Masud MM, Khan L, Thuraisingham BM (2011) Heuristics-based query processing for large rdf graphs using cloud computing. IEEE Trans Knowl Data Eng 23(9):1312–1327

    Article  Google Scholar 

  34. Zeng K, Yang J, Wang H, Shao B, Wang Z (2013) A distributed graph engine for web scale rdf data. Proc VLDB Endow 6(4):265–276

    Article  Google Scholar 

  35. Gurajada S, Seufert S, Miliaraki I, Theobald M (2014) Triad: a distributed shared-nothing rdf engine based on asynchronous message passing. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp 289–300

  36. Xu Z, Chen W, Gai L, Wang T (2015) Sparkrdf: In-memory distributed rdf management framework for large-scale social data. In: International Conference on Web-Age Information Management, pp 337–349

  37. Peng P, Zou L, Özsu M. T, Chen L, Zhao D (2016) Processing sparql queries over distributed rdf graphs. The VLDB J 25(2):243–268

    Article  Google Scholar 

  38. Wang X, Xu Q, Chai L, Yang Y, Chai Y (2019) Efficient distributed query processing on large scale rdf graph data. Ruan Jian Xue Bao/J Softw 30(3):498–514

    MATH  Google Scholar 

  39. Xu J, Zhang C (2019) Semantic connection set-based massive rdf data query processing in spark environment. EURASIP J Wirel Commun Netw 2019(1):1–10

    Article  Google Scholar 

  40. Schätzle A, Przyjaciel-Zablocki M, Skilevic S, Lausen G (2016) S2rdf: Rdf querying with sparql on spark. Proc VLDB Endow 9(10):804–815. https://doi.org/10.14778/2977797.2977806

    Article  Google Scholar 

  41. Xiong Z, Zhu G, Yu W, Wang S, Chong Z (2018) Load-balanced cluster for scale-out storage of knowledge. In: 2018 Sixth International Conference on Advanced Cloud and Big Data (CBD), pp 1–5. https://doi.org/10.1109/CBD.2018.00010

  42. Torre-Bastida AI, Villar-Rodriguez E, Del Ser J, Camacho D, Gonzalez-Rodriguez M (2014) On interlinking linked data sources by using ontology matching techniques and the map-reduce framework. In: International Conference on Intelligent Data Engineering and Automated Learning, pp 53–60

  43. Gu R, Wang S, Guo C, Yuan C, Huang Y (2018) Large scale semantic rule-based backward chaining reasoning on spark. J Chin Inf Process 32(3):120–134

    Google Scholar 

  44. Ahn J, Im D, Eom J, Zong N, Kim H (2014) G-diff: a grouping algorithm for rdf change detection on mapreduce. In: Joint International Semantic Technology Conference, pp 230–235

  45. Lee T, Im D-H, Won J (2016) Similarity-based change detection for rdf in mapreduce. Procedia Comput Sci 91:789–797

    Article  Google Scholar 

  46. Li D, Du Y (2017) Artificial intelligence with uncertainty, 2nd edn. CRC Press, Boca Raton

  47. Wang T Knowledge base for baidubaike. https://doi.org/10.17632/wz6zmvjzb3.1

Download references

Acknowledgements

This work was supported by the Scientific Research Project of Beijing Municipal Education Commission (General Social Science Project) [grant number SM201910038010]; Backup Academic Leaders Grant of Capital University of Economics and Business; and Special Fund of Fundamental Research Expenses of Beijing Municipal University of Capital University of Economics and Business.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ting Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, T., Li, J. & Guo, J. A scalable parallel Chinese online encyclopedia knowledge denoising method based on entry tags and Spark cluster. Appl Intell 51, 7573–7599 (2021). https://doi.org/10.1007/s10489-021-02295-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02295-5

Keywords

Navigation