Skip to main content
Log in

Deep web data extraction based on visual information processing

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

With the rapid development of technology, the Web has become the largest encyclopedic database. Although users can get information conveniently on the surface web by using applications such as browsers, it is hard to retrieve information in the deep web. Deep web requires a user submit a query to the server to get information from its database to generate the result webpage. Thus methods different from traditional Web surfing are needed to conduct the data extraction in deep web. Most of the existing deep web data extraction methods are based on DOM tree analysis. In this paper, to fully utilize the visual information contained in a webpage, a data region locating method based on convolutional neural network and a visual information based segmentation algorithm are proposed. In order to verify the efficiency of the proposed method, we apply it to real world commercial websites to perform data extraction. Experiments of data region location model, data extraction, and data item alignment verify that our proposed method can effectively improve the accuracy of data region location and the efficiency of data extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://www.completeplanet.com.

  2. https://github.com/puzzledqs/BBox-Label-Tool.

References

  • Al-Zahrani A, Ipson SS, Haigh JGB (2004) Applications of a direct algorithm for the rectification of uncalibrated images. Inf Sci 160(1):53–71

    Article  MathSciNet  Google Scholar 

  • Cai D, Yu S, Wen JR et al (2003) VIPS: a vision-based page segmentation algorithm. Microsoft Research

  • Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. Sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence, vol 4, no. 2, pp 328–334

  • Figueiredo LNL, Assis GTD, Ferreira AA (2017) DERIN: a data extraction method based on rendering information and n-gram. Inf Process Manag 53(5):1120–1138

    Article  Google Scholar 

  • Finogeev AG, Parygin DS, Finogeev AA (2017) The convergence computing model for big sensor data mining and knowledge discovery. Hum Centric Comput Inf Sci 7(1):11

    Article  Google Scholar 

  • Freitag D (1998) Information extraction from HTML: application of a general machine learning approach. Fifteenth national/tenth conference on artificial intelligence/innovative applications of artificial intelligence, pp 517–523

  • Gao J, Wang TJ, Yang DQ et al (2004) Ontology-based two-phase semi-automatic web extracting. Chin J Comput 27(3):310–318

    Google Scholar 

  • Girshick R, Donahue J, Darrell T et al. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. Comput Vis Pattern Recog:580–587

  • Islam MN, Islam AKMN (2016) Ontology mapping and semantics of web interface signs. Hum Centric Comput Inf Sci 6(1):20

    Article  Google Scholar 

  • Jeeva SC, Rajsingh EB (2016) Intelligent phishing url detection using association rule mining. Hum Centric Comput Inf Sci 6(1):1–19

    Article  Google Scholar 

  • Kai S, Lausen G (2005) ViPER: augmenting automatic information extraction with visual perceptions. Acm Cikm international conference on information and knowledge management, pp 381–388

  • Kumar K, Saraswathi S (2016) FSA and NLP based un-supervised non template Web data extraction in the construction of dynamic ontology. The international conference, pp 1–8

  • Liu B, Zhai Y (2005) NET—a system for extracting web data from flat and nested data records. Lecture notes in computer science, 3806, pp 487–495

  • Liu B, Grossman R, Zhai Y (2003) Mining data records in Web pages. Acm Sigkdd international conference on knowledge discovery and data mining, pp 601–606

  • Shi S, Liu C, Shen Y et al (2015) AutoRM: an effective approach for automatic Web data record mining. Knowl Based Syst 89:314–331

    Article  Google Scholar 

  • Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3), 233–272

    Article  Google Scholar 

  • Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. International conference on World Wide Web, pp 971–980

  • Yuan L, Zhan-Huai Li, Chen SL (2008) Ontology-based annotation for deep web data. J Softw 19(2):237–245

    Article  CAS  Google Scholar 

  • Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. International conference on World Wide Web, pp 76–85

  • Zouina M, Outtaj B (2017) A novel lightweight URL phishing detection system using SVM and similarity index. Hum Centric Comput Inf Sci 7(1):17

Download references

Acknowledgements

This work was supported by Shanghai Maritime University research fund project (20130469), and by Shanghai Science and Technology Innovation Plan Fund (14511107400), and by State Oceanic Administration China research fund project (201305026). It was also supported by the open research fund of Key Lab of Broadband Wireless Communication and Sensor Network Technology, Nanjing University of Posts and Telecommunications, Ministry of Education.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hye-jin Kim.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Lin, L., Cai, Z. et al. Deep web data extraction based on visual information processing. J Ambient Intell Human Comput 15, 1481–1491 (2024). https://doi.org/10.1007/s12652-017-0587-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-017-0587-0

Keywords

Navigation