Skip to main content
Log in

A semantic and intelligent focused crawler based on semantic vector space model and membrane computing optimization algorithm

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The focused crawler downloads web pages related to the given topic from the Internet. In many research studies, most of focused crawler predict the priority values of unvisited hyperlinks by integrating the topic similarities based on the text similarity model and equivalent weighted factors based on the manual method. However, in these focused crawlers, there are flaws in the text similarity models, and weighted factors are arbitrarily determined for calculating priorities of unvisited URLs. To solve these problems, this paper proposes a semantic and intelligent focused crawler based on the Semantic Vector Space Model (SVSM) and the Membrane Computing Optimization Algorithm (MCOA). Firstly, the SVSM method is used to calculate topic similarities between texts and the given topic. Secondly, the MCOA method is used to optimize four weighted factors based on the evolution rules and the communication rule. Finally, this proposed focused crawler predicts the priority of each unvisited hyperlink by integrating the topic similarities of four texts and the optimal four weighted factors. The experiment results indicate that the proposed SVSM-MCOA Crawler improve the evaluation indicators compared with the other four focused crawlers. In conclusion, the proposed SVSM and MCOA method promotes the focused crawler to have semantic understanding and intelligent learning ability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Pant G, Srinivasan P (2006) Link contexts in classifier-guided topical crawlers. IEEE Trans Knowl Data Eng 18(1):107–122

    Article  Google Scholar 

  2. Tsikrika T, Moumtzidou A, Vrochidis S et al (2016) Focussed crawling of environmental web resources based on the combination of multimedia evidence. Multimedia Tools and Applications 75(3):1563–1587

    Article  Google Scholar 

  3. Yang YK, Du YJ, Sun JY et al (2008) A topic-specific web crawler with concept similarity context graph based on FCA. Lect Notes Comput Sci 5227(1):840–847

    Article  Google Scholar 

  4. Batsakis S, Petrakis EGM, Milios E (2009) Improving the performance of focused web crawlers. Data Knowl Eng 68(10):1001–1013

    Article  Google Scholar 

  5. P Hegade N Lingadhal S Jain et al 2021 Crawler by Contextual Inference SN Computer Science 2 3 216 1 12

  6. Lu HQ, Zhan DH, Zhou L et al (2016) An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation. Math Probl Eng 2016(3):1–10

    MathSciNet  Google Scholar 

  7. Rajiv S, Navaneethan C (2021) Keyword weight optimization using gradient strategies in event focused web crawling. Pattern Recogn Lett 142:3–10

    Article  Google Scholar 

  8. Farag MMG, Lee S, Fox EA (2018) Focused crawler for event. Int J Digit Libr 19(1):3–19

    Article  Google Scholar 

  9. Patel A, Schmidt N (2011) Application of structured document parsing to focused web crawling. Computer Standards & Interfaces 33(3):325–331

    Article  Google Scholar 

  10. Li MM, Li CL, Wu C et al (2015) A Focused Crawler URL Analysis Algorithm based on Semantic Content and Link Clustering in Cloud Environment. International Journal of Grid and Distributed Computing 8(2):49–60

    Article  Google Scholar 

  11. Prabha KSS, Mahesh C, Raja SP (2021) An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm. Cybernetics and Information Technologies 21(2):105–120

    Article  Google Scholar 

  12. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Communications of the Association for Computing Machinery 18(11):613–620

    Article  MATH  Google Scholar 

  13. Varelas G, Voutsakis E, Raftopoulou P et al (2005) Semantic similarity methods in WordNet and their application to information retrieval on the web. Proceedings of the 7th annual ACM international workshop on Web information and data management, Bremen, Germany. 10–16

  14. Wang GG, Deb S, Cui ZH (2019) Monarch butterfly optimization. Neural Comput Appl 31:1995–2014

    Article  Google Scholar 

  15. Li S, Chen H, Wang MJ et al (2020) Slime mould algorithm: A new method for stochastic optimization. Futur Gener Comput Syst 111:300–323

    Article  Google Scholar 

  16. Yang YT, Chen HL, Heidari AA et al (2021) Hunger games search: Visions, conception, implementation, deep analysis, perspectives, and towards performance shifts. Expert Systems With Applications 177:114864

    Article  Google Scholar 

  17. Ahmadianfar I, Heidari AA, Gandomi AH et al (2021) RUN beyond the metaphor: An efficient optimization algorithm based on Runge Kutta method. Expert Systems With Applications 181:115079

    Article  Google Scholar 

  18. Liu WJ, Du YJ (2014) A novel focused crawler based on cell-like membrane computing optimization algorithm. Neurocomputing 123(1):266–280

    Article  Google Scholar 

  19. Pavkovic M, Protic J (2019) SInFo - Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content Retrieval. IEEE Access 7:126941–126961

    Article  Google Scholar 

  20. Lagopoulos A, Tsoumakas G (2020) Content-aware web robot detection. Appl Intell 50(11):4017–4028

    Article  Google Scholar 

  21. Zhao W, Guan ZY, Cao ZW et al (2016) Mining and Harvesting High Quality Topical Resources from the Web[J]. Chin J Electron 25(1):48–57

    Article  Google Scholar 

  22. Seyfi A, Patel A, Celestino J (2016) Empirical evaluation of the link and content-based focused Treasure-Crawler. Computer Standards & Interfaces 44:54–62

    Article  Google Scholar 

  23. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30(1–7):107–117

    Article  Google Scholar 

  24. Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632

    Article  MathSciNet  MATH  Google Scholar 

  25. Diligenti M, Coetzee FM, Lawrence S et al (2000) Focused crawling using context graphs. Proceedings of the 26th International Conference on Very Large Database (VLDB), Cairo, Egypt 527–534

  26. Hsua CC, Wu F (2006) Topic-specific crawling on the Web with the measurements of the relevancy context graph. Inf Syst 31(4–5):232–246

    Article  Google Scholar 

  27. Hernandez J, Marin-Castro HM, Morales-Sandoval M (2020) A Semantic Focused Web Crawler Based on a Knowledge Representation Schema. Applied Science, 10(11): 3837, 1–21

  28. Capuano A, Rinaldi AM, Russo C (2020) An ontology-driven multimedia focused crawler based on linked open data and deep learning techniques. Multimedia Tools and Applications 79(11–12):7577–7598

    Article  Google Scholar 

  29. Hliaoutakis A, Varelas G, Voutsakis E et al (2006) Information retrieval by semantic similarity. Int J Semant Web Inf Syst 3(3):55–73

    Article  Google Scholar 

  30. Zhang GX, Pan LQ (2010) A Survey of Membrane Computing as a New Branch of Natural Computing. Chinese Journal of Computers 2:208–214

    Article  Google Scholar 

  31. Wang W, Yu LH (2021) UCrawler: A learning-based web crawler using a URL knowledge base. Journal of Computational Methods in Sciences and Engineering 21(2):461–474

    Article  Google Scholar 

  32. Dong H, Hussain FK (2013) SOF: a semi-supervised ontology-learning-based focused crawler. Concurrecny and Computation-Practice & Experience 25(12):1755–1770

    Article  Google Scholar 

  33. Zhang HX, Lu J (2010) SCTWC: An online semi-supervised clustering approach to topical web crawlers. Appl Soft Comput 10(2):490–495

    Article  Google Scholar 

  34. Du YJ, Liu WJ, Lv XJ et al (2015) An improved focused crawler based on Semantic Similarity Vector Space Model. Appl Soft Comput 36(11):392–407

    Article  Google Scholar 

  35. Prakoso DW, Abdi A, Amrit C (2021) Short text similarity measurement methods: a review. Soft Comput 25(6):4699–4723

    Article  Google Scholar 

  36. Mohammed N, Mohammed D (2017) Experimental Study of Semantic Similarity Measures on Arabic WordNet. International Journal of Computer Science and Network Security 17(2):131–140

    Google Scholar 

  37. Lin D (1998) An Information-Theoretic Definition of Similarity. Proceedings of the 15th International Conference on Machine Learning, Madison, USA, 296–304

  38. Li ZX, Zhang L, Su YS et al (2018) A skin membrane-driven membrane algorithm for many-objective optimization. Neural Comput Appl 30(1):141–152

    Article  Google Scholar 

  39. Raghavan S, Chandrasekaran K (2021) Membrane-based models for service selection in cloud. Inf Sci 558:103–123

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 61872298), the Science and Technology Department of Sichuan Province (Grant No. 2021YFQ0008), the College Student Innovation and Entrepreneurship Training Project of Sichuan Province (Grant No. S202110650044) and the Education and Teaching Reform Research Project of Xihua University (Grant No. xjjg2019026).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yajun Du.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, W., Gan, Z., Xi, T. et al. A semantic and intelligent focused crawler based on semantic vector space model and membrane computing optimization algorithm. Appl Intell 53, 7390–7407 (2023). https://doi.org/10.1007/s10489-022-03180-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03180-5

Keywords

Navigation