Skip to main content
Log in

Focused crawling strategies based on ontologies and simulated annealing methods for rainstorm disaster domain knowledge

基于本体和模拟退火算法的暴雨灾害主题爬虫策略

  • Research Article
  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

At present, focused crawler is a crucial method for obtaining effective domain knowledge from massive heterogeneous networks. For most current focused crawling technologies, there are some difficulties in obtaining high-quality crawling results. The main difficulties are the establishment of topic benchmark models, the assessment of topic relevance of hyperlinks, and the design of crawling strategies. In this paper, we use domain ontology to build a topic benchmark model for a specific topic, and propose a novel multiple-filtering strategy based on local ontology and global ontology (MFSLG). A comprehensive priority evaluation method (CPEM) based on the web text and link structure is introduced to improve the computation precision of topic relevance for unvisited hyperlinks, and a simulated annealing (SA) method is used to avoid the focused crawler falling into local optima of the search. By incorporating SA into the focused crawler with MFSLG and CPEM for the first time, two novel focused crawler strategies based on ontology and SA (FCOSA), including FCOSA with only global ontology (FCOSA_G) and FCOSA with both local ontology and global ontology (FCOSA_LG), are proposed to obtain topic-relevant webpages about rainstorm disasters from the network. Experimental results show that the proposed crawlers outperform the other focused crawling strategies on different performance metric indices.

摘要

目前, 主题爬虫是从海量异构网络中获取有效领域知识的重要方法. 目前大多数主题爬虫技术难以获得高质量爬行结果. 主要难点包括主题基准模型的建立、 超链接主题相关性的评估和爬行策略的设计. 本文采用领域本体为特定主题构建主题基准模型, 并提出一种新的基于局部本体和全局本体的多重筛选策略 (MFSLG). 为提高待访问超链接主题相关性计算精度, 提出一种基于网页文本和链接结构的综合优先度评估方法 (CPEM), 同时, 采用模拟退火 (SA) 算法避免主题爬虫陷入局部最优搜索. 本文首次设计融合SA算法、 MFSLG策略和CPEM策略实现主题爬虫, 提出两种新的基于本体和SA主题爬虫策略 (FCOSA), 包括基于全局本体的FCOSA策略 (FCOSA_G) 和基于局部本体和全局本体的FCOSA策略 (FCOSA_LG), 以从网络中获取与暴雨灾害主题相关的网页. 实验结果表明, 针对不同性能指标, 所提爬虫策略优于其他主题爬虫策略.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

Download references

Author information

Authors and Affiliations

Authors

Contributions

Jingfa LIU designed the research. Fan LI drafted the paper, implemented the software, and performed the experiments. Ruoyao DING and Zi’ang LIU revised and finalized the paper.

Corresponding authors

Correspondence to Jingfa Liu  (刘景发) or Fan Li  (李帆).

Ethics declarations

Jingfa LIU, Fan LI, Ruoyao DING, and Zi’ang LIU declare that they have no conflict of interest.

Additional information

Project supported by the Special Foundation of Guangzhou Key Laboratory of Multilingual Intelligent Processing, China (No. 201905010008), the Program of Science and Technology of Guangzhou, China (No. 202002030238), and the Guangdong Basic and Applied Basic Research Foundation, China (No. 2021A1515011974)

List of supplementary materials

Fig. S1 A global ontology structure about the topic of rainstorm disaster

Table S1 Seed URLs

Supplementary materials for

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Li, F., Ding, R. et al. Focused crawling strategies based on ontologies and simulated annealing methods for rainstorm disaster domain knowledge. Front Inform Technol Electron Eng 23, 1189–1204 (2022). https://doi.org/10.1631/FITEE.2100360

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.2100360

Key words

关键词

CLC number

Navigation