A novel approach for Web page modeling in personal information extraction

Yuliang, Wei; Qi, Zhou; Fang, Lv; Xixian, Han; Guodong, Xin; Bailing, Wang

doi:10.1007/s11280-018-0631-9

A novel approach for Web page modeling in personal information extraction

Published: 05 September 2018

Volume 22, pages 603–620, (2019)
Cite this article

World Wide Web Aims and scope Submit manuscript

Wei Yuliang ORCID: orcid.org/0000-0001-5408-1430¹,
Zhou Qi¹,
Lv Fang¹,
Han Xixian¹,
Xin Guodong¹ &
…
Wang Bailing¹

1075 Accesses
4 Citations
Explore all metrics

Abstract

The target of personal information extraction (PIE) is to extract content associated with a name form Web pages. Available Web page models, which are also used widely in content extraction and automatic wrapper algorithms, include text model, document object model, and vision-based page segmentation model. Because of existing models focus on Web structure rather than semantic relevance, they are difficult to be directly used for PIE. To deal with this problem, we introduce the sequence block model (SBM), by which is easy to determine the relevance of each page block to the retrieval name. Then, we give the definition of PIE based on the SBM. Depending on the sequence correlation of SBM, we design a 4-layer seq2seq deep learning network for PIE. Experiment result shows that our new model extracts twice as much data as content extraction algorithms. And the recall rate of the network is 7% higher than the traditional model with classification algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks

Article 09 May 2018

Impact of word embedding models on text analytics in deep learning environment: a review

Article 22 February 2023

Information extraction from electronic medical documents: state of the art and future research directions

Article 08 November 2022

References

Banu, A., Chitra, M.: Dwde-ir: an efficient deep Web data extraction for information retrieval on Web mining. J. Emerg. Technol. Web Intell. 6(1), 133–141 (2014)
Google Scholar
Bartoli, A., De Lorenzo, A., Medvet, E., Tarlao, F.: Inference of regular expressions for text extraction from examples. IEEE Trans. Knowl. Data Eng. 28(5), 1217–1230 (2016)
Article Google Scholar
Bu, Z., Zhang, C., Xia, Z., Wang, J.: An far-sw based approach for Webpage information extraction. Inf. Syst. Front. 16(5), 771–785 (2014)
Article Google Scholar
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Article Google Scholar
Cramer, D.: A library to extract meaningful data from a Webpage. https://code.google.com/archive/p/decruft/
Cuthbertson, T.: Python-readability. https://github.com/timbertson/python-readability
Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale Web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011)
Article Google Scholar
Doddington, G.R., Mitchell, A., Przybocki, M.A., Ramshaw, L.A., Strassel, S., Weischedel, R.M.: The automatic content extraction (ace) program-tasks, data, and evaluation. In: LREC, vol. 2, pp. 837–840 (2004)
Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)
Article Google Scholar
Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Commun. 17(2), 57–61 (2004)
Google Scholar
Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: a language for scalable data extraction, automation, and crawling on the deep Web. VLDB J. 22(1), 47–72 (2013)
Article Google Scholar
Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimedia 19(9), 2045–2055 (2017)
Article Google Scholar
Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for Web page information extraction. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 154–163. Springer, Berlin (2016)
Grigalis, T., Radvilavičius, L., Čenys, A., Gordevičius, J.: Clustering visually similar Web page elements for structured Web data extraction. In: Web Engineering, pp. 435–438 (2012)
Hadnagy, C.: Social Engineering: the Art of Human Hacking. Wiley, New York (2010)
Google Scholar
Jarrett Irons, G.Y.: Goose - article extractor. https://github.com/GravityLabs/goose
Junyi, S.: jparser - parsing binary files made easy. https://github.com/fxsjy/jparser
Kohlschütter, C.: Boilerplate removal and fulltext extraction from html pages. https://github.com/kohlschutter
Krishna, S.S., Dattatraya, J.S.: Schema inference and data extraction from templatized Web pages. In: 2015 International Conference on Pervasive Computing (ICPC), pp. 1–6. IEEE (2015)
Kushmerick, N.: Finite-state approaches to Web information extraction. In: Lecture Notes in Computer Science, pp. 77–91 (2003)
Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.: Regular expression learning for information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 21–30. Association for Computational Linguistics (2008)
Li, J.Q., Zhao, Y., Garcia-Molina, H.: A path-based approach for Web page retrieval. World Wide Web 15(3), 257–283 (2012)
Article Google Scholar
Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, pp. 33–40. Association for Computational Linguistics (2003)
Saleh, A.I., Al Rahmawy, M.F., Abulwafa, A.E.: A semantic based Web page classification strategy using multi-layered domain ontology. World Wide Web 20(5), 939–993 (2017)
Article Google Scholar
Sanoja, A., Gancarski, S.: Block-o-matic: a Web page segmentation framework. In: 2014 International Conference on Multimedia Computing and Systems (ICMCS), pp. 595–600. IEEE (2014)
Sleiman, H.A., Corchuelo, R.: Tex: an efficient and effective unsupervised Web information extractor. Knowl.-Based Syst. 39, 109–123 (2013)
Article Google Scholar
Song, D., Sun, F., Liao, L.: A hybrid approach for content extraction with text density and visual importance of dom nodes. Knowl. Inf. Syst. 42(1), 75–96 (2015)
Article Google Scholar
Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 27 (7), 3210–3221 (2018)
Article MathSciNet MATH Google Scholar
Thamviset, W., Wongthanavasu, S.: Information extraction for deep Web using repetitive subject pattern. World Wide Web 17(5), 1109–1139 (2014)
Article Google Scholar
Vijendran, A.S., Deepa, C.: LBDA: a novel framework for extracting content from Web pages. In: 2013 International Conference on Advanced Computing & Communication Systems (ICACCS), pp. 1–7. IEEE (2013)
Wei, Y., Wang, B., Liu, Y., Lv, F.: Research on Webpage similarity computing technology based on visual blocks. In: Chinese National Conference on Social Media Processing, pp. 187–197 (2014)
Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1(2), 270–280 (1989)
Article Google Scholar
Wu, G., Li, L., Hu, X., Wu, X.: Web news extraction via path ratios. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2059–2068. ACM (2013)
Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. 26(5), 2494–2507 (2017)
Article MathSciNet MATH Google Scholar
Xu, X., He, L., Lu, H., Gao, L., Ji, Y.: Deep adversarial metric learning for cross-modal retrieval. World Wide Web, pp. 1–16 (2018)
Zhang, C., Liu, C., Zhang, X., Almpanidis, G.: An up-to-date comparison of state-of-the-art classification algorithms. Expert. Syst. Appl. 82, 128–150 (2017)
Article Google Scholar

Download references

Acknowledgements

This work is partially supported by National Key Research and Development Program of China (No. 2016YFB0800802) and Shandong Key Research and Development Plan under grant (No.2016ZDJS01A04 and No.2017CXGC0706).

Author information

Authors and Affiliations

Harbin Institute of Technology, Weihai, Shandong, People’s Republic of China
Wei Yuliang, Zhou Qi, Lv Fang, Han Xixian, Xin Guodong & Wang Bailing

Authors

Wei Yuliang
View author publications
You can also search for this author in PubMed Google Scholar
Zhou Qi
View author publications
You can also search for this author in PubMed Google Scholar
Lv Fang
View author publications
You can also search for this author in PubMed Google Scholar
Han Xixian
View author publications
You can also search for this author in PubMed Google Scholar
Xin Guodong
View author publications
You can also search for this author in PubMed Google Scholar
Wang Bailing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wang Bailing.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications

Guest Editors: Jingkuan Song, Shuqiang Jiang, Elisa Ricci, and Zi Huang

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuliang, W., Qi, Z., Fang, L. et al. A novel approach for Web page modeling in personal information extraction. World Wide Web 22, 603–620 (2019). https://doi.org/10.1007/s11280-018-0631-9

Download citation

Received: 13 August 2017
Revised: 09 July 2018
Accepted: 07 August 2018
Published: 05 September 2018
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s11280-018-0631-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel approach for Web page modeling in personal information extraction

Abstract

Access this article

Similar content being viewed by others

Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks

Impact of word embedding models on text analytics in deep learning environment: a review

Information extraction from electronic medical documents: state of the art and future research directions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel approach for Web page modeling in personal information extraction

Abstract

Access this article

Similar content being viewed by others

Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks

Impact of word embedding models on text analytics in deep learning environment: a review

Information extraction from electronic medical documents: state of the art and future research directions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation