Skip to main content
Log in

A new focused crawler using an improved tabu search algorithm incorporating ontology and host information

一种新的融合本体和主机信息的改进禁忌搜索算法的主题爬虫方法

  • Research Article
  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

To solve the problems of incomplete topic description and repetitive crawling of visited hyperlinks in traditional focused crawling methods, in this paper, we propose a novel focused crawler using an improved tabu search algorithm with domain ontology and host information (FCITS_OH), where a domain ontology is constructed by formal concept analysis to describe topics at the semantic and knowledge levels. To avoid crawling visited hyperlinks and expand the search range, we present an improved tabu search (ITS) algorithm and the strategy of host information memory. In addition, a comprehensive priority evaluation method based on Web text and link structure is designed to improve the assessment of topic relevance for unvisited hyperlinks. Experimental results on both tourism and rainstorm disaster domains show that the proposed focused crawlers overmatch the traditional focused crawlers for different performance metrics.

摘要

为解决传统主题爬虫方法存在的主题描述不完整和重复爬取已访问链接的问题, 本文提出一种新的融合本体和主机信息的改进禁忌搜索算法的主题爬虫方法(FCITS_OH). 该方法基于形式概念分析(FCA)构建领域本体, 在语义和知识层面描述主题. 为避免重复爬取已访问的链接和扩大搜索范围, 提出一种改进的禁忌搜索(ITS)算法和记忆主机信息的策略. 此外, 为改进未访问链接的主题相关性的评估方法, 提出一种基于Web文本和链接结构的综合优先度评估方法. 以旅游和暴雨灾害为主题的实验结果表明, 对于不同的性能指标, 所提出的爬虫方法优于文献中其它主题爬虫策略.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Data availability

Data are available in a public repository.

References

Download references

Author information

Authors and Affiliations

Authors

Contributions

Jingfa LIU designed the research. Zhen WANG drafted the paper, implemented the software, and performed the experiments. Guo ZHONG and Zhihe YANG revised and finalized the paper.

Corresponding author

Correspondence to Zhen Wang  (王震).

Ethics declarations

Jingfa LIU, Zhen WANG, Guo ZHONG, and Zhihe YANG declare that they have no conflict of interest.

Additional information

Project supported by the Guangdong Basic and Applied Basic Research Foundation of China (Nos. 2021A1515011974 and 2023A1515011344) and the Program of Science and Technology of Guangzhou, China (No. 202002030238)

List of supplementary materials

Table S1 Seed uniform resource locators (URLs) in the tourism domain

Table S2 Seed uniform resource locators (URLs) in the rainstorm disaster domain

Supplementary materials for

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Wang, Z., Zhong, G. et al. A new focused crawler using an improved tabu search algorithm incorporating ontology and host information. Front Inform Technol Electron Eng 24, 859–875 (2023). https://doi.org/10.1631/FITEE.2200315

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.2200315

Key words

关键词

CLC number

Navigation