Deep web data extraction based on visual information processing

Liu, Jin; Lin, Li; Cai, Zehuan; Wang, Jin; Kim, Hye-jin

doi:10.1007/s12652-017-0587-0

Deep web data extraction based on visual information processing

Original Research
Published: 12 October 2017

Volume 15, pages 1481–1491, (2024)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Jin Liu ORCID: orcid.org/0000-0001-7249-698X¹,
Li Lin¹,
Zehuan Cai¹,
Jin Wang^2,3 &
…
Hye-jin Kim⁴

516 Accesses
10 Citations
Explore all metrics

Abstract

With the rapid development of technology, the Web has become the largest encyclopedic database. Although users can get information conveniently on the surface web by using applications such as browsers, it is hard to retrieve information in the deep web. Deep web requires a user submit a query to the server to get information from its database to generate the result webpage. Thus methods different from traditional Web surfing are needed to conduct the data extraction in deep web. Most of the existing deep web data extraction methods are based on DOM tree analysis. In this paper, to fully utilize the visual information contained in a webpage, a data region locating method based on convolutional neural network and a visual information based segmentation algorithm are proposed. In order to verify the efficiency of the proposed method, we apply it to real world commercial websites to perform data extraction. Experiments of data region location model, data extraction, and data item alignment verify that our proposed method can effectively improve the accuracy of data region location and the efficiency of data extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review of object detection based on deep learning

Article 12 June 2020

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Convolutional neural network: a review of models, methodologies and applications to object detection

Article 20 December 2019

Notes

References

Al-Zahrani A, Ipson SS, Haigh JGB (2004) Applications of a direct algorithm for the rectification of uncalibrated images. Inf Sci 160(1):53–71
Article MathSciNet Google Scholar
Cai D, Yu S, Wen JR et al (2003) VIPS: a vision-based page segmentation algorithm. Microsoft Research
Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. Sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence, vol 4, no. 2, pp 328–334
Figueiredo LNL, Assis GTD, Ferreira AA (2017) DERIN: a data extraction method based on rendering information and n-gram. Inf Process Manag 53(5):1120–1138
Article Google Scholar
Finogeev AG, Parygin DS, Finogeev AA (2017) The convergence computing model for big sensor data mining and knowledge discovery. Hum Centric Comput Inf Sci 7(1):11
Article Google Scholar
Freitag D (1998) Information extraction from HTML: application of a general machine learning approach. Fifteenth national/tenth conference on artificial intelligence/innovative applications of artificial intelligence, pp 517–523
Gao J, Wang TJ, Yang DQ et al (2004) Ontology-based two-phase semi-automatic web extracting. Chin J Comput 27(3):310–318
Google Scholar
Girshick R, Donahue J, Darrell T et al. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. Comput Vis Pattern Recog:580–587
Islam MN, Islam AKMN (2016) Ontology mapping and semantics of web interface signs. Hum Centric Comput Inf Sci 6(1):20
Article Google Scholar
Jeeva SC, Rajsingh EB (2016) Intelligent phishing url detection using association rule mining. Hum Centric Comput Inf Sci 6(1):1–19
Article Google Scholar
Kai S, Lausen G (2005) ViPER: augmenting automatic information extraction with visual perceptions. Acm Cikm international conference on information and knowledge management, pp 381–388
Kumar K, Saraswathi S (2016) FSA and NLP based un-supervised non template Web data extraction in the construction of dynamic ontology. The international conference, pp 1–8
Liu B, Zhai Y (2005) NET—a system for extracting web data from flat and nested data records. Lecture notes in computer science, 3806, pp 487–495
Liu B, Grossman R, Zhai Y (2003) Mining data records in Web pages. Acm Sigkdd international conference on knowledge discovery and data mining, pp 601–606
Shi S, Liu C, Shen Y et al (2015) AutoRM: an effective approach for automatic Web data record mining. Knowl Based Syst 89:314–331
Article Google Scholar
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3), 233–272
Article Google Scholar
Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. International conference on World Wide Web, pp 971–980
Yuan L, Zhan-Huai Li, Chen SL (2008) Ontology-based annotation for deep web data. J Softw 19(2):237–245
Article CAS Google Scholar
Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. International conference on World Wide Web, pp 76–85
Zouina M, Outtaj B (2017) A novel lightweight URL phishing detection system using SVM and similarity index. Hum Centric Comput Inf Sci 7(1):17

Download references

Acknowledgements

This work was supported by Shanghai Maritime University research fund project (20130469), and by Shanghai Science and Technology Innovation Plan Fund (14511107400), and by State Oceanic Administration China research fund project (201305026). It was also supported by the open research fund of Key Lab of Broadband Wireless Communication and Sensor Network Technology, Nanjing University of Posts and Telecommunications, Ministry of Education.

Author information

Authors and Affiliations

College of Information Engineering, Shanghai Maritime University, Shanghai, China
Jin Liu, Li Lin & Zehuan Cai
Key Laboratory of Broadband Wireless Communication and Sensor Network Technology (Nanjing University of Posts and Telecommunications), Ministry of Education, Nanjing, China
Jin Wang
College of Information Engineering, Yangzhou University, Yangzhou, China
Jin Wang
Business Administration Research Institute, Sungshin W. University, Seoul, South Korea
Hye-jin Kim

Authors

Jin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Li Lin
View author publications
You can also search for this author in PubMed Google Scholar
Zehuan Cai
View author publications
You can also search for this author in PubMed Google Scholar
Jin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hye-jin Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hye-jin Kim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, J., Lin, L., Cai, Z. et al. Deep web data extraction based on visual information processing. J Ambient Intell Human Comput 15, 1481–1491 (2024). https://doi.org/10.1007/s12652-017-0587-0

Download citation

Received: 23 June 2017
Accepted: 23 September 2017
Published: 12 October 2017
Issue Date: February 2024
DOI: https://doi.org/10.1007/s12652-017-0587-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep web data extraction based on visual information processing

Abstract

Access this article

Similar content being viewed by others

A review of object detection based on deep learning

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Convolutional neural network: a review of models, methodologies and applications to object detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep web data extraction based on visual information processing

Abstract

Access this article

Similar content being viewed by others

A review of object detection based on deep learning

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Convolutional neural network: a review of models, methodologies and applications to object detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation