Abstract
The target of personal information extraction (PIE) is to extract content associated with a name form Web pages. Available Web page models, which are also used widely in content extraction and automatic wrapper algorithms, include text model, document object model, and vision-based page segmentation model. Because of existing models focus on Web structure rather than semantic relevance, they are difficult to be directly used for PIE. To deal with this problem, we introduce the sequence block model (SBM), by which is easy to determine the relevance of each page block to the retrieval name. Then, we give the definition of PIE based on the SBM. Depending on the sequence correlation of SBM, we design a 4-layer seq2seq deep learning network for PIE. Experiment result shows that our new model extracts twice as much data as content extraction algorithms. And the recall rate of the network is 7% higher than the traditional model with classification algorithm.
Similar content being viewed by others
References
Banu, A., Chitra, M.: Dwde-ir: an efficient deep Web data extraction for information retrieval on Web mining. J. Emerg. Technol. Web Intell. 6(1), 133–141 (2014)
Bartoli, A., De Lorenzo, A., Medvet, E., Tarlao, F.: Inference of regular expressions for text extraction from examples. IEEE Trans. Knowl. Data Eng. 28(5), 1217–1230 (2016)
Bu, Z., Zhang, C., Xia, Z., Wang, J.: An far-sw based approach for Webpage information extraction. Inf. Syst. Front. 16(5), 771–785 (2014)
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Cramer, D.: A library to extract meaningful data from a Webpage. https://code.google.com/archive/p/decruft/
Cuthbertson, T.: Python-readability. https://github.com/timbertson/python-readability
Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale Web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011)
Doddington, G.R., Mitchell, A., Przybocki, M.A., Ramshaw, L.A., Strassel, S., Weischedel, R.M.: The automatic content extraction (ace) program-tasks, data, and evaluation. In: LREC, vol. 2, pp. 837–840 (2004)
Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)
Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Commun. 17(2), 57–61 (2004)
Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: a language for scalable data extraction, automation, and crawling on the deep Web. VLDB J. 22(1), 47–72 (2013)
Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimedia 19(9), 2045–2055 (2017)
Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for Web page information extraction. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 154–163. Springer, Berlin (2016)
Grigalis, T., Radvilavičius, L., Čenys, A., Gordevičius, J.: Clustering visually similar Web page elements for structured Web data extraction. In: Web Engineering, pp. 435–438 (2012)
Hadnagy, C.: Social Engineering: the Art of Human Hacking. Wiley, New York (2010)
Jarrett Irons, G.Y.: Goose - article extractor. https://github.com/GravityLabs/goose
Junyi, S.: jparser - parsing binary files made easy. https://github.com/fxsjy/jparser
Kohlschütter, C.: Boilerplate removal and fulltext extraction from html pages. https://github.com/kohlschutter
Krishna, S.S., Dattatraya, J.S.: Schema inference and data extraction from templatized Web pages. In: 2015 International Conference on Pervasive Computing (ICPC), pp. 1–6. IEEE (2015)
Kushmerick, N.: Finite-state approaches to Web information extraction. In: Lecture Notes in Computer Science, pp. 77–91 (2003)
Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.: Regular expression learning for information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 21–30. Association for Computational Linguistics (2008)
Li, J.Q., Zhao, Y., Garcia-Molina, H.: A path-based approach for Web page retrieval. World Wide Web 15(3), 257–283 (2012)
Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, pp. 33–40. Association for Computational Linguistics (2003)
Saleh, A.I., Al Rahmawy, M.F., Abulwafa, A.E.: A semantic based Web page classification strategy using multi-layered domain ontology. World Wide Web 20(5), 939–993 (2017)
Sanoja, A., Gancarski, S.: Block-o-matic: a Web page segmentation framework. In: 2014 International Conference on Multimedia Computing and Systems (ICMCS), pp. 595–600. IEEE (2014)
Sleiman, H.A., Corchuelo, R.: Tex: an efficient and effective unsupervised Web information extractor. Knowl.-Based Syst. 39, 109–123 (2013)
Song, D., Sun, F., Liao, L.: A hybrid approach for content extraction with text density and visual importance of dom nodes. Knowl. Inf. Syst. 42(1), 75–96 (2015)
Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 27 (7), 3210–3221 (2018)
Thamviset, W., Wongthanavasu, S.: Information extraction for deep Web using repetitive subject pattern. World Wide Web 17(5), 1109–1139 (2014)
Vijendran, A.S., Deepa, C.: LBDA: a novel framework for extracting content from Web pages. In: 2013 International Conference on Advanced Computing & Communication Systems (ICACCS), pp. 1–7. IEEE (2013)
Wei, Y., Wang, B., Liu, Y., Lv, F.: Research on Webpage similarity computing technology based on visual blocks. In: Chinese National Conference on Social Media Processing, pp. 187–197 (2014)
Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1(2), 270–280 (1989)
Wu, G., Li, L., Hu, X., Wu, X.: Web news extraction via path ratios. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2059–2068. ACM (2013)
Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. 26(5), 2494–2507 (2017)
Xu, X., He, L., Lu, H., Gao, L., Ji, Y.: Deep adversarial metric learning for cross-modal retrieval. World Wide Web, pp. 1–16 (2018)
Zhang, C., Liu, C., Zhang, X., Almpanidis, G.: An up-to-date comparison of state-of-the-art classification algorithms. Expert. Syst. Appl. 82, 128–150 (2017)
Acknowledgements
This work is partially supported by National Key Research and Development Program of China (No. 2016YFB0800802) and Shandong Key Research and Development Plan under grant (No.2016ZDJS01A04 and No.2017CXGC0706).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications
Guest Editors: Jingkuan Song, Shuqiang Jiang, Elisa Ricci, and Zi Huang
Rights and permissions
About this article
Cite this article
Yuliang, W., Qi, Z., Fang, L. et al. A novel approach for Web page modeling in personal information extraction. World Wide Web 22, 603–620 (2019). https://doi.org/10.1007/s11280-018-0631-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-018-0631-9