Abstract
With the rapid development of technology, the Web has become the largest encyclopedic database. Although users can get information conveniently on the surface web by using applications such as browsers, it is hard to retrieve information in the deep web. Deep web requires a user submit a query to the server to get information from its database to generate the result webpage. Thus methods different from traditional Web surfing are needed to conduct the data extraction in deep web. Most of the existing deep web data extraction methods are based on DOM tree analysis. In this paper, to fully utilize the visual information contained in a webpage, a data region locating method based on convolutional neural network and a visual information based segmentation algorithm are proposed. In order to verify the efficiency of the proposed method, we apply it to real world commercial websites to perform data extraction. Experiments of data region location model, data extraction, and data item alignment verify that our proposed method can effectively improve the accuracy of data region location and the efficiency of data extraction.
Similar content being viewed by others
References
Al-Zahrani A, Ipson SS, Haigh JGB (2004) Applications of a direct algorithm for the rectification of uncalibrated images. Inf Sci 160(1):53–71
Cai D, Yu S, Wen JR et al (2003) VIPS: a vision-based page segmentation algorithm. Microsoft Research
Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. Sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence, vol 4, no. 2, pp 328–334
Figueiredo LNL, Assis GTD, Ferreira AA (2017) DERIN: a data extraction method based on rendering information and n-gram. Inf Process Manag 53(5):1120–1138
Finogeev AG, Parygin DS, Finogeev AA (2017) The convergence computing model for big sensor data mining and knowledge discovery. Hum Centric Comput Inf Sci 7(1):11
Freitag D (1998) Information extraction from HTML: application of a general machine learning approach. Fifteenth national/tenth conference on artificial intelligence/innovative applications of artificial intelligence, pp 517–523
Gao J, Wang TJ, Yang DQ et al (2004) Ontology-based two-phase semi-automatic web extracting. Chin J Comput 27(3):310–318
Girshick R, Donahue J, Darrell T et al. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. Comput Vis Pattern Recog:580–587
Islam MN, Islam AKMN (2016) Ontology mapping and semantics of web interface signs. Hum Centric Comput Inf Sci 6(1):20
Jeeva SC, Rajsingh EB (2016) Intelligent phishing url detection using association rule mining. Hum Centric Comput Inf Sci 6(1):1–19
Kai S, Lausen G (2005) ViPER: augmenting automatic information extraction with visual perceptions. Acm Cikm international conference on information and knowledge management, pp 381–388
Kumar K, Saraswathi S (2016) FSA and NLP based un-supervised non template Web data extraction in the construction of dynamic ontology. The international conference, pp 1–8
Liu B, Zhai Y (2005) NET—a system for extracting web data from flat and nested data records. Lecture notes in computer science, 3806, pp 487–495
Liu B, Grossman R, Zhai Y (2003) Mining data records in Web pages. Acm Sigkdd international conference on knowledge discovery and data mining, pp 601–606
Shi S, Liu C, Shen Y et al (2015) AutoRM: an effective approach for automatic Web data record mining. Knowl Based Syst 89:314–331
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3), 233–272
Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. International conference on World Wide Web, pp 971–980
Yuan L, Zhan-Huai Li, Chen SL (2008) Ontology-based annotation for deep web data. J Softw 19(2):237–245
Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. International conference on World Wide Web, pp 76–85
Zouina M, Outtaj B (2017) A novel lightweight URL phishing detection system using SVM and similarity index. Hum Centric Comput Inf Sci 7(1):17
Acknowledgements
This work was supported by Shanghai Maritime University research fund project (20130469), and by Shanghai Science and Technology Innovation Plan Fund (14511107400), and by State Oceanic Administration China research fund project (201305026). It was also supported by the open research fund of Key Lab of Broadband Wireless Communication and Sensor Network Technology, Nanjing University of Posts and Telecommunications, Ministry of Education.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, J., Lin, L., Cai, Z. et al. Deep web data extraction based on visual information processing. J Ambient Intell Human Comput 15, 1481–1491 (2024). https://doi.org/10.1007/s12652-017-0587-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-017-0587-0