Abstract
To solve the problems of incomplete topic description and repetitive crawling of visited hyperlinks in traditional focused crawling methods, in this paper, we propose a novel focused crawler using an improved tabu search algorithm with domain ontology and host information (FCITS_OH), where a domain ontology is constructed by formal concept analysis to describe topics at the semantic and knowledge levels. To avoid crawling visited hyperlinks and expand the search range, we present an improved tabu search (ITS) algorithm and the strategy of host information memory. In addition, a comprehensive priority evaluation method based on Web text and link structure is designed to improve the assessment of topic relevance for unvisited hyperlinks. Experimental results on both tourism and rainstorm disaster domains show that the proposed focused crawlers overmatch the traditional focused crawlers for different performance metrics.
摘要
为解决传统主题爬虫方法存在的主题描述不完整和重复爬取已访问链接的问题, 本文提出一种新的融合本体和主机信息的改进禁忌搜索算法的主题爬虫方法(FCITS_OH). 该方法基于形式概念分析(FCA)构建领域本体, 在语义和知识层面描述主题. 为避免重复爬取已访问的链接和扩大搜索范围, 提出一种改进的禁忌搜索(ITS)算法和记忆主机信息的策略. 此外, 为改进未访问链接的主题相关性的评估方法, 提出一种基于Web文本和链接结构的综合优先度评估方法. 以旅游和暴雨灾害为主题的实验结果表明, 对于不同的性能指标, 所提出的爬虫方法优于文献中其它主题爬虫策略.
Data availability
Data are available in a public repository.
References
Asano Y, Tezuka Y, Nishizeki T, 2007. Improvements of HITS algorithms for spam links. Proc 9th Asia-Pacific Web Conf and 8th Int Conf on Web-Age Information Management, p.479-490. https://doi.org/10.1007/978-3-540-72524-4_50
Chakrabarti S, van den Berg M, Dom B, 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw, 31(11–16):1623–1640. https://doi.org/10.1016/S1389-1286(99)00052-3
de Bra P, Houben GJ, Kornatzky Y, et al., 1994. Information retrieval in distributed hypertexts. Proc RIAO: Intelligent Multimedia Information Retrieval Systems and Management, p.481-491.
Deng SQ, 2020. Research on the focused crawler of mineral intelligence service based on semantic similarity. J Phys Conf Ser, 1575:012142. https://doi.org/10.1088/1742-6596/1575/1/012142
Derrac J, García S, Molina D, et al., 2011. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput, 1(1):3–18. https://doi.org/10.1016/j.swevo.2011.02.002
Du YJ, Hai YF, Xie CZ, et al., 2014. An approach for selecting seed URLs of focused crawler based on user-interest ontology. Appl Soft Comput, 14:663–676. https://doi.org/10.1016/j.asoc.2013.09.007
Farag MMG, Lee S, Fox EA, 2018. Focused crawler for events. Int J Dig Libr, 19(1):3–19. https://doi.org/10.1007/s00799-016-0207-1
Fei CJ, Liu BS, 2018. Focused crawler based on LDA extended topic terms. Comput Appl Softw, 35(4):49–54 (in Chinese). https://doi.org/10.3969/j.issn.1000-386x.2018.04.009
Guan WG, Luo YC, 2016. Design and implementation of focused crawler based on concept context graph. Comput Eng Des, 37(10):2679–2684 (in Chinese). https://doi.org/10.16208/j.issn1000-7024.2016.10.019
He S, Cheng JX, Cai XB, 2009. Focused crawler based on simulated anneal algorithm. Comput Technol Dev, 19(12):55–58, 62 (in Chinese). https://doi.org/10.3969/j.issn.1673-629X.2009.12.015
Hosseinkhani J, Taherdoost H, Keikhaee S, 2021. ANTON framework based on semantic focused crawler to support Web crime mining using SVM. Ann Data Sci, 8(2):227–240. https://doi.org/10.1007/s40745-019-00208-5
Jiang QC, Zhang Y, 2007. SiteRank-based crawling ordering strategy for search engines. Proc 7th IEEE Int Conf on Computer and Information Technology, p.259-263. https://doi.org/10.1109/CIT.2007.35
Khan MA, Sharma DK, 2016. Self-adaptive ontology-based focused crawling: a literature survey. Proc 5th Int Conf on Reliability, Infocom Technologies and Optimization (Trends and Future Directions), p.595-601. https://doi.org/10.1109/ICRITO.2016.7785024
Lakzaei B, Shmasfard M, 2021. Ontology learning from relational databases. Inform Sci, 577:280–297. https://doi.org/10.1016/j.ins.2021.06.074
Li L, Zhang GY, Li ZW, 2015. Research on focused crawling technology based on SVM. Comput Sci, 42(2):118–122 (in Chinese). https://doi.org/10.11896/j.issn.1002-137X.2015.2.025
Liu JF, Li F, Jiang SY, 2019. Focused annealing crawler algorithm for rainstorm disasters based on comprehensive priority and host information. Comput Sci, 46(2):215–222 (in Chinese). https://doi.org/10.11896/j.issn.1002-137X.2019.02.033
Liu JF, Wang DW, Yan XM, 2021. Tabu search algorithm for dynamic facility layout problem. J Huazhong Univ Sci Technol (Nat Sci Ed), 49(2):44–50 (in Chinese). https://doi.org/10.13245/j.hust.210206
Liu JF, Dong Y, Liu ZX, et al., 2022a. Applying ontology learning and multi-objective ant colony optimization method for focused crawling to meteorological disasters domain knowledge. Expert Syst Appl, 198:116741. https://doi.org/10.1016/j.eswa.2022.116741
Liu JF, Li X, Zhang QS, et al., 2022b. A novel focused crawler combining Web space evolution and domain ontology. Knowl-Based Syst, 243:108495. https://doi.org/10.1016/j.knosys.2022.108495
Liu WJ, Du YJ, 2014. A novel focused crawler based on cell-like membrane computing optimization algorithm. Neurocomputing, 123:266–280. https://doi.org/10.1016/j.neucom.2013.06.039
Ma LL, Li HW, Lian SW, et al., 2016. A strategy of disaster focused crawler based on ontology semantics. Comput Eng, 42(11):50–56 (in Chinese). https://doi.org/10.3969/j.issn.1000-3428.2016.11.009
Prakash J, Kumar R, 2015. Web crawling through shark-search using PageRank. Proc Comput Sci, 48:210–216. https://doi.org/10.1016/j.procs.2015.04.172
Rani M, Dhar AK, Vyas OP, 2017. Semi-automatic terminology ontology learning based on topic modeling. Eng Appl Artif Intell, 63:108–125. https://doi.org/10.1016/j.engappai.2017.05.006
Rawat S, Patil DR, 2013. Efficient focused crawling based on best first search. Proc 3rd IEEE Int Advance Computing Conf, p.908-911. https://doi.org/10.1109/IAdCC.2013.6514347
Tong YL, 2008. Application of focused crawler using adaptive dynamical evolutional particle swarm optimization. Geomat Inform Sci Wuhan Univ, 33(12):1296–1299 (in Chinese).
Wang ZG, Meng BJ, 2014. A comparison of approaches to Chinese word segmentation in Hadoop. Proc IEEE Int Conf on Data Mining Workshop, p.844-850. https://doi.org/10.1109/ICDMW.2014.43
Wu TY, 2018. Research on information retrieval technology based on Word2vec+BM25. Electron World, 2018(22):135–136. https://doi.org/10.19353/j.cnki.dzsj.2018.22.080
Wu YL, Zhao SL, Li CJ, et al., 2017. Text classification method based on TF-IDF and cosine similarity. J Chin Inform Process, 31(5):138–145 (in Chinese). https://doi.org/10.3969/j.issn.1003-0077.2017.05.020
Xiao JJ, Chen ZY, 2018. Focused crawling based on grey wolf algorithms. Comput Sci, 45(11A):146–148, 166 (in Chinese).
Yan W, Pan L, 2018. Designing focused crawler based on improved genetic algorithm. Proc 10th Int Conf on Advanced Computational Intelligence, p.319-323. https://doi.org/10.1109/ICACI.2018.8377476
Yu J, Liu G, 2015. Survey on topic-focused crawlers. Comput Eng Sci, 37(2):231–237 (in Chinese). https://doi.org/10.3969/j.issn.1007-130X.2015.02.007
Yuan ZQ, Zhang WH, Fu HJ, et al., 2017. A PageRank-improved ranking algorithm based on cheating similarity and cheating relevance. Proc IEEE/ACIS 16th Int Conf on Computer and Information Science, p.257-263. https://doi.org/10.1109/ICIS.2017.7960003
Zhu G, Yang JY, Wu XH, et al., 2017. Research on construction of hierarchy relationship and ontology of meteorological disaster based on FCA. Mod Inform, 37(5):79–88 (in Chinese). https://doi.org/10.3969/j.issn.1008-0821.2017.05.014
Author information
Authors and Affiliations
Contributions
Jingfa LIU designed the research. Zhen WANG drafted the paper, implemented the software, and performed the experiments. Guo ZHONG and Zhihe YANG revised and finalized the paper.
Corresponding author
Ethics declarations
Jingfa LIU, Zhen WANG, Guo ZHONG, and Zhihe YANG declare that they have no conflict of interest.
Additional information
Project supported by the Guangdong Basic and Applied Basic Research Foundation of China (Nos. 2021A1515011974 and 2023A1515011344) and the Program of Science and Technology of Guangzhou, China (No. 202002030238)
List of supplementary materials
Table S1 Seed uniform resource locators (URLs) in the tourism domain
Table S2 Seed uniform resource locators (URLs) in the rainstorm disaster domain
Supplementary materials for
Rights and permissions
About this article
Cite this article
Liu, J., Wang, Z., Zhong, G. et al. A new focused crawler using an improved tabu search algorithm incorporating ontology and host information. Front Inform Technol Electron Eng 24, 859–875 (2023). https://doi.org/10.1631/FITEE.2200315
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.2200315