Abstract
Focused crawler (FC) is a web crawler that downloads only relevant web pages for a given topic. The main source of biomedical information is now the Internet. The volume, pace, variety, and caliber of online biomedical information, however, pose difficulties and necessitate ameliorated facilitation methods for biological information to crawl. The search engine must have an efficacious, targeted crawler mechanism in order to retrieve precise biomedical information. To address these challenges a new FC is proposed using Gaussian support vector regression to calculate the importance of the web page. The synonym computation of the topic term using popular biomedical ontology unified medical language system helps the proposed crawler to improve the performance of relevance computation module. The newly designed crawler outperforms existing crawlers with an average harvest rate \(\left( {h_{{{\text{rate}}}} } \right)\) of 0.37 and an average irrelevance ratio \(\left( {p_{{{\text{rate}}}} } \right)\) of 0.63 after 5000 webpage crawls on biomedical topics. Experimental results reveal the proposed FC improved performance of focused crawling for biomedical topics in crawling environment.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Salton G, Wong A, Yang C. Information retrieval and language processing: a vector space model for automatic indexing. Commun ACM. 1975;18(11):613.
Liu WJ, Du YJ. A novel FC based on cell-like membrane computing optimization algorithm. Neurocomputing. 2014;123:266–80.
Bedi P, Thukral A, Banati H. Focused crawling of tagged web resources using ontology. Comput Electr Eng. 2013;39(2):613–28.
Du Y, Liu W, Lv X, Peng G. An improved FC based on semantic similarity vector space model. Appl Soft Comput J. 2015;36:392–407.
Wu Z, Palmer M. Verbs semantics and lexical selection. Assoc Comput Ling. 1994;133–138:1994.
Dong H, Hussain FK. Self-adaptive semantic FC for mining services information discovery. IEEE Trans Ind Inform. 2014;10(2):1616–26.
Resnik P. Using information content to evaluate semantic similarity in a taxonomy. In: Computation and Language 1995. p. 448-53. https://doi.org/10.48550/arXiv.cmp-lg/9511007
Joe Dhanith PR, Surendiran B. An ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithm. Int J Comput Appl. 2019;2019:1–7.
Capuano A, Rinaldi AM, Russo C. An ontology-driven multimedia FC based on linked open data and deep learning techniques. Multimed Tools Appl. 2019;2019:1.
Li Y, Bandar ZA, McLean D. An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans Knowl Data Eng. 2003;15(4):871–82.
Zheng HT, Kang BY, Kim HG. An ontology-based approach to learnable focused crawling. Inf Sci (NY). 2008;178(23):4512–22.
Dong H, Hussain FK. SOF: a semi-supervised ontology-learning-based FC. Concurr Comput Pract Exp. 2013;25(6):1755–70.
Chang S, Yang G, Jianmei Y, Bin L. An efficient adaptive FC based on ontology learning. In: Proceedings of the HIS 2005 5th international conference on hybrid intelligent systems, vol. 2005; 2005. p. 73–8.
Hassan T, Cruz C, Bertaux A. Ontology-based approach for unsupervised and adaptive focused crawling. In: Proceedings of the international workshop on semantic Big Data, SBD 2017—in conjunction with the 2017 ACM SIGMOD/PODS conference; 2017. p. 1–6.
Li S, Ouyang Y, Wang W, Sun B. Multi-document summarization using support vector regression. In: Proceedings DUC 2007, Rochester, USA; 2007. p. 1–5.
Wang W, Xu Z. A heuristic training for support vector regression. Neurocomputing. 2004;61(1–4):259–75.
Su BH, Wang YL. Genetic algorithm based feature selection and parameter optimization for support vector regression applied to semantic textual similarity. J Shanghai Jiaotong Univ. 2015;20(2):143–8.
Smola AJ, Sch B. Smola, Schölkopf—2004—statistics and computing—a tutorial on support vector regression.pdf. Stat Comput. 2004;14(3):199–222.
Mani-Sekhar SR, Siddesh GM, Manvi SS, Srinivasa KG. Optimized FC with natural language processing based relevance measure in bioinformatics web sources. Cybern Inf Technol. 2019;19(2):146–58.
Zowalla R, Wetter T, Math D, Pfeifer D. Crawling the German health web : exploratory study and graph analysis corresponding author. J Med Internet Res. 2020;22:1–22.
Srinivasan P, Mitchell J, Bodenreider O, Pant G, Menczer F. Web crawling agents for retrieving biomedical information. In: Proceedings of the international workshop on agents in bioinformatics, no. January 2013; 2002.
Abbasi A, Fu T, Zeng D, Adjeroh D. Crawling credible online medical sentiments for social intelligence. Proc Soc. 2013;2013:254–63.
Amalia A, Gunawan D, Najwan A, Meirina F. FC for the acquisition of health articles. In: Proceedings of the 2016 international conference on data software engineering ICoDSE 2016; 2017.
Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucl Acids Res. 2004;32:D267–70.
Tang TT, Hawking D, Craswell N, Griffiths K. Focused crawling for both topical relevance and qualify of medical information. In: International conference on information and knowledge management proceedings; 2005. p. 147–54.
Xu S, Yoon HJ, Tourassi G. A user-oriented web crawler for selectively acquiring online content in e-health research. Bioinformatics. 2014;30(1):104–14.
Yan H. Internet medicine information monitoring system based on FC. In: 3rd international conference on information sciences and interaction sciences Chengdu; 2010. p. 452–6.
Farag MMG, Lee S, Fox EA. FC for events. Int J Digit Libr. 2018;19(1):3–19.
Boukadi K, Rekik M, Rekik M, Ben-Abdallah H. FC4CD: a new SOA-based FC for cloud service discovery. Computing. 2018;100(10):1081–107.
Suebchua T, Manaskasemsak B, Rungsawang A, Yamana H. Efficient topical focused crawling through neighborhood feature. New Gener Comput. 2018;36(2):95–118.
van Rossum G. Python tutorial, technical report CS-R9526. Cent. voor Wiskd. en Inform. (CWI). Amsterdam; 1995.
Spyder. Spyder Ide, Spyder Project; 2018 (Online). Available: https://www.spyder-ide.org/.
Najork M, Wiener JL. Breadth-first search crawling yields high-quality pages. In: Proceedings of the 10th international conference on world wide web, WWW 2001; 2001. p. 114–8.
Navaneethan C, et al. A supervised learning-based approach for focused web crawling for IoMT using global co-occurrence matrix. Expert Syst. 2022;187:110327 (ISSN 0266-4720).
Meenatchi S, et al. Evaluating the impact of summer drought on vegetation growth using space-based solar-induced chlorophyll fluorescence across extensive spatial measures. Big Data. 2022;10(3):230–45 (ISSN:2167-6461).
Navaneethan C, et al. Applications of internet of things for smart farming—a survey. Mater Today Proc. 2021;47:18–24 (ISSN: 2214-7853).
Navaneethan C, et al. Color contour texture based peanut classification using deep spread spectral features classification model for assortment identification. Sustain Energy Technol Assess. 2022;2022:102524 (ISSN 2213-1388).
Songhao J, Jizheng Y, Cai* Y, Haiyu Z. Research on MapReduce heuristic multi table join algorithm based on binary optimization and pancake parallel strategy. Recent Patents Eng. 2023;17(6):e241022210342.
Pooja* J, Kavita T, Harmunish T. Convolutional neural network based intelligent advertisement search framework for online English newspapers. Recent Patents Eng. 2022;16(4):e150721194823.
Mary JDPNR, Balasubramanian S, Raj RSP. An enhanced focused web crawler for biomedical topics using attention enhanced Siamese long short term memory networks. Braz Arch Biol Technol. 2022;64:e21210163.
Funding
No funding received for this research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Advances in Computational Approaches for Image Processing, Wireless Networks, Cloud Applications and Network Security” guest edited by P. Raviraj, Maode Ma and Roopashree H R.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Rajiv, S., Navaneethan, C. An Optimal Topic Centric Crawler for Acquiring Bio-medical Themes Utilizing Gaussian Support Vector Regression. SN COMPUT. SCI. 4, 838 (2023). https://doi.org/10.1007/s42979-023-02306-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-023-02306-x