Skip to main content
Log in

An Optimal Topic Centric Crawler for Acquiring Bio-medical Themes Utilizing Gaussian Support Vector Regression

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Focused crawler (FC) is a web crawler that downloads only relevant web pages for a given topic. The main source of biomedical information is now the Internet. The volume, pace, variety, and caliber of online biomedical information, however, pose difficulties and necessitate ameliorated facilitation methods for biological information to crawl. The search engine must have an efficacious, targeted crawler mechanism in order to retrieve precise biomedical information. To address these challenges a new FC is proposed using Gaussian support vector regression to calculate the importance of the web page. The synonym computation of the topic term using popular biomedical ontology unified medical language system helps the proposed crawler to improve the performance of relevance computation module. The newly designed crawler outperforms existing crawlers with an average harvest rate \(\left( {h_{{{\text{rate}}}} } \right)\) of 0.37 and an average irrelevance ratio \(\left( {p_{{{\text{rate}}}} } \right)\) of 0.63 after 5000 webpage crawls on biomedical topics. Experimental results reveal the proposed FC improved performance of focused crawling for biomedical topics in crawling environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  1. Salton G, Wong A, Yang C. Information retrieval and language processing: a vector space model for automatic indexing. Commun ACM. 1975;18(11):613.

    Article  MATH  Google Scholar 

  2. Liu WJ, Du YJ. A novel FC based on cell-like membrane computing optimization algorithm. Neurocomputing. 2014;123:266–80.

    Article  Google Scholar 

  3. Bedi P, Thukral A, Banati H. Focused crawling of tagged web resources using ontology. Comput Electr Eng. 2013;39(2):613–28.

    Article  Google Scholar 

  4. Du Y, Liu W, Lv X, Peng G. An improved FC based on semantic similarity vector space model. Appl Soft Comput J. 2015;36:392–407.

    Article  Google Scholar 

  5. Wu Z, Palmer M. Verbs semantics and lexical selection. Assoc Comput Ling. 1994;133–138:1994.

    Google Scholar 

  6. Dong H, Hussain FK. Self-adaptive semantic FC for mining services information discovery. IEEE Trans Ind Inform. 2014;10(2):1616–26.

    Article  Google Scholar 

  7. Resnik P. Using information content to evaluate semantic similarity in a taxonomy. In: Computation and Language 1995. p. 448-53. https://doi.org/10.48550/arXiv.cmp-lg/9511007

  8. Joe Dhanith PR, Surendiran B. An ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithm. Int J Comput Appl. 2019;2019:1–7.

    Google Scholar 

  9. Capuano A, Rinaldi AM, Russo C. An ontology-driven multimedia FC based on linked open data and deep learning techniques. Multimed Tools Appl. 2019;2019:1.

    Google Scholar 

  10. Li Y, Bandar ZA, McLean D. An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans Knowl Data Eng. 2003;15(4):871–82.

    Article  Google Scholar 

  11. Zheng HT, Kang BY, Kim HG. An ontology-based approach to learnable focused crawling. Inf Sci (NY). 2008;178(23):4512–22.

    Article  Google Scholar 

  12. Dong H, Hussain FK. SOF: a semi-supervised ontology-learning-based FC. Concurr Comput Pract Exp. 2013;25(6):1755–70.

    Article  Google Scholar 

  13. Chang S, Yang G, Jianmei Y, Bin L. An efficient adaptive FC based on ontology learning. In: Proceedings of the HIS 2005 5th international conference on hybrid intelligent systems, vol. 2005; 2005. p. 73–8.

  14. Hassan T, Cruz C, Bertaux A. Ontology-based approach for unsupervised and adaptive focused crawling. In: Proceedings of the international workshop on semantic Big Data, SBD 2017—in conjunction with the 2017 ACM SIGMOD/PODS conference; 2017. p. 1–6.

  15. Li S, Ouyang Y, Wang W, Sun B. Multi-document summarization using support vector regression. In: Proceedings DUC 2007, Rochester, USA; 2007. p. 1–5.

  16. Wang W, Xu Z. A heuristic training for support vector regression. Neurocomputing. 2004;61(1–4):259–75.

    Article  Google Scholar 

  17. Su BH, Wang YL. Genetic algorithm based feature selection and parameter optimization for support vector regression applied to semantic textual similarity. J Shanghai Jiaotong Univ. 2015;20(2):143–8.

    Article  Google Scholar 

  18. Smola AJ, Sch B. Smola, Schölkopf—2004—statistics and computing—a tutorial on support vector regression.pdf. Stat Comput. 2004;14(3):199–222.

    Article  MathSciNet  Google Scholar 

  19. Mani-Sekhar SR, Siddesh GM, Manvi SS, Srinivasa KG. Optimized FC with natural language processing based relevance measure in bioinformatics web sources. Cybern Inf Technol. 2019;19(2):146–58.

    Google Scholar 

  20. Zowalla R, Wetter T, Math D, Pfeifer D. Crawling the German health web : exploratory study and graph analysis corresponding author. J Med Internet Res. 2020;22:1–22.

    Article  Google Scholar 

  21. Srinivasan P, Mitchell J, Bodenreider O, Pant G, Menczer F. Web crawling agents for retrieving biomedical information. In: Proceedings of the international workshop on agents in bioinformatics, no. January 2013; 2002.

  22. Abbasi A, Fu T, Zeng D, Adjeroh D. Crawling credible online medical sentiments for social intelligence. Proc Soc. 2013;2013:254–63.

    Google Scholar 

  23. Amalia A, Gunawan D, Najwan A, Meirina F. FC for the acquisition of health articles. In: Proceedings of the 2016 international conference on data software engineering ICoDSE 2016; 2017.

  24. Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucl Acids Res. 2004;32:D267–70.

    Article  Google Scholar 

  25. Tang TT, Hawking D, Craswell N, Griffiths K. Focused crawling for both topical relevance and qualify of medical information. In: International conference on information and knowledge management proceedings; 2005. p. 147–54.

  26. Xu S, Yoon HJ, Tourassi G. A user-oriented web crawler for selectively acquiring online content in e-health research. Bioinformatics. 2014;30(1):104–14.

    Article  Google Scholar 

  27. Yan H. Internet medicine information monitoring system based on FC. In: 3rd international conference on information sciences and interaction sciences Chengdu; 2010. p. 452–6.

  28. Farag MMG, Lee S, Fox EA. FC for events. Int J Digit Libr. 2018;19(1):3–19.

    Article  Google Scholar 

  29. Boukadi K, Rekik M, Rekik M, Ben-Abdallah H. FC4CD: a new SOA-based FC for cloud service discovery. Computing. 2018;100(10):1081–107.

    Article  Google Scholar 

  30. Suebchua T, Manaskasemsak B, Rungsawang A, Yamana H. Efficient topical focused crawling through neighborhood feature. New Gener Comput. 2018;36(2):95–118.

    Article  Google Scholar 

  31. van Rossum G. Python tutorial, technical report CS-R9526. Cent. voor Wiskd. en Inform. (CWI). Amsterdam; 1995.

  32. Spyder. Spyder Ide, Spyder Project; 2018 (Online). Available: https://www.spyder-ide.org/.

  33. Najork M, Wiener JL. Breadth-first search crawling yields high-quality pages. In: Proceedings of the 10th international conference on world wide web, WWW 2001; 2001. p. 114–8.

  34. Navaneethan C, et al. A supervised learning-based approach for focused web crawling for IoMT using global co-occurrence matrix. Expert Syst. 2022;187:110327 (ISSN 0266-4720).

    Google Scholar 

  35. Meenatchi S, et al. Evaluating the impact of summer drought on vegetation growth using space-based solar-induced chlorophyll fluorescence across extensive spatial measures. Big Data. 2022;10(3):230–45 (ISSN:2167-6461).

    Article  Google Scholar 

  36. Navaneethan C, et al. Applications of internet of things for smart farming—a survey. Mater Today Proc. 2021;47:18–24 (ISSN: 2214-7853).

    Article  Google Scholar 

  37. Navaneethan C, et al. Color contour texture based peanut classification using deep spread spectral features classification model for assortment identification. Sustain Energy Technol Assess. 2022;2022:102524 (ISSN 2213-1388).

    Google Scholar 

  38. Songhao J, Jizheng Y, Cai* Y, Haiyu Z. Research on MapReduce heuristic multi table join algorithm based on binary optimization and pancake parallel strategy. Recent Patents Eng. 2023;17(6):e241022210342.

  39. Pooja* J, Kavita T, Harmunish T. Convolutional neural network based intelligent advertisement search framework for online English newspapers. Recent Patents Eng. 2022;16(4):e150721194823.

  40. Mary JDPNR, Balasubramanian S, Raj RSP. An enhanced focused web crawler for biomedical topics using attention enhanced Siamese long short term memory networks. Braz Arch Biol Technol. 2022;64:e21210163.

    Article  Google Scholar 

Download references

Funding

No funding received for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. Navaneethan.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Image Processing, Wireless Networks, Cloud Applications and Network Security” guest edited by P. Raviraj, Maode Ma and Roopashree H R.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rajiv, S., Navaneethan, C. An Optimal Topic Centric Crawler for Acquiring Bio-medical Themes Utilizing Gaussian Support Vector Regression. SN COMPUT. SCI. 4, 838 (2023). https://doi.org/10.1007/s42979-023-02306-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-02306-x

Keywords

Navigation