Efficient Page-Level Data Extraction via Schema Induction and Verification

Chang, Chia-Hui; Chen, Tian-Sheng; Chen, Ming-Chuan; Ding, Jhung-Li

doi:10.1007/978-3-319-31750-2_38

Chia-Hui Chang¹⁹,
Tian-Sheng Chen¹⁹,
Ming-Chuan Chen¹⁹ &
…
Jhung-Li Ding¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9652))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3137 Accesses
1 Citations

Abstract

Page-level data extraction provides a complete solution for all kinds of information requirement, however very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, previous page-level systems focus on how to achieve unsupervised data extraction and pay less attention on schema/wrapper generation and verification. In this paper, we emphasize the importance of schema verification for large-scale extraction tasks. Given a large amount of web pages for data extraction, the system uses part of the input pages for training the schema without supervision, and then extracts data from the rest of the input pages through schema verification. To speed up the processing, we utilize leaf nodes of the DOM trees as the processing units and dynamically adjust the encoding for better alignment. The proposed system works better than other page-level extraction systems in terms of schema correctness and extraction efficiency. Overall, the extraction efficiency is 2.7 times faster than state-of-the-art unsupervised approaches that extract data page by page without schema verification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Article 22 July 2019

A novel alignment algorithm for effective web data extraction from singleton-item pages

Article 15 June 2018

Main Content Extraction from Heterogeneous Webpages

References

Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD 2003 San Diego, California, USA, pp. 337–348 (2003)
Google Scholar
Bing, L., Lam, W., Gu, Y.: Towards a unified solution: Data record region detection and segmentation. In: CIKM 2011, Glasgow, Scotland, UK, pp. 1265–1274 (2011)
Google Scholar
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE TKDE 18(10), 1411–1428 (2006)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web sites. In: VLDB 2001, Roma, Italy, pp. 109–118 (2001)
Google Scholar
Kayed, M., Chang, C.H.: FiVaTech: Page-level web data extraction from template pages. IEEE TKDE 22, 249–263 (2010)
Google Scholar
Kushmerick, N.: Wrapper verification. WWW 3(2), 79–94 (2000)
Article MATH Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., de Silva, A.S., Teixeira, J.S.: A brief survey of Web data extraction tools. In: ACM SIGMOD (2002)
Google Scholar
Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: A machine learning approach. J. Artif. Intell. Res. 18, 149–181 (2003)
MATH Google Scholar
Liu, W., Meng, X.F., Meng, W.Y.: ViDE: A vision-based approach for deep web data extraction. IEEE TKDE 22, 447–460 (2010)
Google Scholar
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: WWW 2009, pp. 981–990 (2009)
Google Scholar
Sleiman, H.A., Corchuelo, R.: TEX: An efficient and effective unsupervised web information extractor. Knowl. Based Syst. 39, 109–123 (2013)
Article Google Scholar
Sleiman, H.A., Corchuelo, R.: A survey on region extractors from documents. IEEE Trans. Knowl. Data Eng. 25(9), 1960–1981 (2013)
Article Google Scholar
Zheng, S., Song, R., Wen, J.-R., Giles, C.L.: Efficient record-level wrapper induction. In: CIKM 2009, pp. 47–56 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

CSIE, National Central University, Zhongli District, Taiwan
Chia-Hui Chang, Tian-Sheng Chen, Ming-Chuan Chen & Jhung-Li Ding

Authors

Chia-Hui Chang
View author publications
You can also search for this author in PubMed Google Scholar
Tian-Sheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Chuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jhung-Li Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chia-Hui Chang .

Editor information

Editors and Affiliations

The University of Melbourne, Melbourne, Victoria, Australia
James Bailey
The University of Texas at Dallas, Richardson, Texas, USA
Latifur Khan
Osaka University, Osaka, Japan
Takashi Washio
University of Auckland, Auckland, New Zealand
Gill Dobbie
Shenzhen University, Shenzhen, China
Joshua Zhexue Huang
Massey University, Auckland, New Zealand
Ruili Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, CH., Chen, TS., Chen, MC., Ding, JL. (2016). Efficient Page-Level Data Extraction via Schema Induction and Verification. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-31750-2_38
Published: 12 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Page-Level Data Extraction via Schema Induction and Verification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

A novel alignment algorithm for effective web data extraction from singleton-item pages

Main Content Extraction from Heterogeneous Webpages

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient Page-Level Data Extraction via Schema Induction and Verification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

A novel alignment algorithm for effective web data extraction from singleton-item pages

Main Content Extraction from Heterogeneous Webpages

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation