Abstract
Page-level data extraction provides a complete solution for all kinds of information requirement, however very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, previous page-level systems focus on how to achieve unsupervised data extraction and pay less attention on schema/wrapper generation and verification. In this paper, we emphasize the importance of schema verification for large-scale extraction tasks. Given a large amount of web pages for data extraction, the system uses part of the input pages for training the schema without supervision, and then extracts data from the rest of the input pages through schema verification. To speed up the processing, we utilize leaf nodes of the DOM trees as the processing units and dynamically adjust the encoding for better alignment. The proposed system works better than other page-level extraction systems in terms of schema correctness and extraction efficiency. Overall, the extraction efficiency is 2.7 times faster than state-of-the-art unsupervised approaches that extract data page by page without schema verification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD 2003 San Diego, California, USA, pp. 337–348 (2003)
Bing, L., Lam, W., Gu, Y.: Towards a unified solution: Data record region detection and segmentation. In: CIKM 2011, Glasgow, Scotland, UK, pp. 1265–1274 (2011)
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE TKDE 18(10), 1411–1428 (2006)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web sites. In: VLDB 2001, Roma, Italy, pp. 109–118 (2001)
Kayed, M., Chang, C.H.: FiVaTech: Page-level web data extraction from template pages. IEEE TKDE 22, 249–263 (2010)
Kushmerick, N.: Wrapper verification. WWW 3(2), 79–94 (2000)
Laender, A.H.F., Ribeiro-Neto, B.A., de Silva, A.S., Teixeira, J.S.: A brief survey of Web data extraction tools. In: ACM SIGMOD (2002)
Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: A machine learning approach. J. Artif. Intell. Res. 18, 149–181 (2003)
Liu, W., Meng, X.F., Meng, W.Y.: ViDE: A vision-based approach for deep web data extraction. IEEE TKDE 22, 447–460 (2010)
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: WWW 2009, pp. 981–990 (2009)
Sleiman, H.A., Corchuelo, R.: TEX: An efficient and effective unsupervised web information extractor. Knowl. Based Syst. 39, 109–123 (2013)
Sleiman, H.A., Corchuelo, R.: A survey on region extractors from documents. IEEE Trans. Knowl. Data Eng. 25(9), 1960–1981 (2013)
Zheng, S., Song, R., Wen, J.-R., Giles, C.L.: Efficient record-level wrapper induction. In: CIKM 2009, pp. 47–56 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Chang, CH., Chen, TS., Chen, MC., Ding, JL. (2016). Efficient Page-Level Data Extraction via Schema Induction and Verification. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_38
Download citation
DOI: https://doi.org/10.1007/978-3-319-31750-2_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)