DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Yuliana, Oviliani Yenty; Chang, Chia-Hui

doi:10.1007/s10489-019-01499-0

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Published: 22 July 2019

Volume 50, pages 271–295, (2020)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

276 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, we consider the problem of full schema induction from either multiple list pages or singleton pages with the same template. Existing approaches do not work well for this problem because they use fixed abstraction schemes that are suitable for data-rich detection, but they are not appropriate for small records and complex data found in other sections. We propose an unsupervised full schema web data extraction via Divide-and-Conquer Alignment with Dynamic Encoding (DCADE for short). We define the Content Equivalence Class (CEC) and Typeset Equivalence Class (TEC) based on leaf node content. We then combine HTML attributes (i.e., id and class) in the paths for various levels of encoding, so that the proposed algorithm can align leaf nodes by exploring patterns at various levels from specific to general. We conducted experiments on 49 real-world websites used in TEX and ExAlg. The proposed DCADE achieved a 0.962 F1 measure for non-recordset data extraction (denoted by F_D), and a 0.936 F1 measure for recordset data extraction (denoted by F_S), which outperformed other page-level web data extraction methods, i.e., DCA (F_D= 0.660), TEX (F_D= 0.454 and F_S= 0.549), RoadRunner (F_D= 0.396 and F_S= 0.330), and UWIDE (F_D= 0.260 and F_S= 0.081).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel alignment algorithm for effective web data extraction from singleton-item pages

Article 15 June 2018

Oviliani Yenty Yuliana & Chia-Hui Chang

Efficient Page-Level Data Extraction via Schema Induction and Verification

Automatic Extraction of Logical Web Lists

Notes

If the r th page does not contain any leaf node with CECIde, we assign -1 to e.FP[r].
If there exists frequent patterns RT or RD with NSD = 0, we will set 𝜃_NSD to 0, otherwise 𝜃_NSD = 0.5.
http://www.tdg-seville.info/Hassan/TEX
The detail extraction output of each approach can be found in the A section.

References

Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp 337–348
Bing L, Lam W, Wong TL (2013) Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: Proceedings of the sixth ACM international conference on Web search and data mining, pp 567–576
Bronzi M, Crescenzi V, Merialdo P, Papotti P (2013) Extraction and integration of partially overlapping web sources. VLDB J 6(10):805–816
Google Scholar
Carlson A, Betteridge J, Wang RC, Hruschka R, Mitchell TM (2010) Coupled semi-supervised learning for information extraction. In: Proceedings of the third ACM international conference on Web search and data mining, pp 101–110
Chang CH, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428
Article Google Scholar
Chang CH, Chen TS, Chen MC, Ding JL (2016) Efficient page-level data extraction via schema induction and verification. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining, pp 478–490
Chu X, He Y, Chakrabarti K, Ganjam K (2015) Tegra: table extraction by global record alignment. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1713–1728
Crescenzi V, Mecca G (2005) Automatic information extraction from large websites. Journal of the ACM (JACM) 51(5):731–779
Article MathSciNet Google Scholar
Crescenzi V, Merialdo P, Alfred DQ (2013) Alfred: crowd assisted data extraction. In: Proceedings of the 22nd international conference on World Wide Web, pp 297–300
Dhillon PS, Sellamanickam S, Selvaraj SK (2011) Semi-supervised multi-task learning of structured prediction models for web information extraction. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp 957–966
Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl-Based Syst 70:301–323
Article Google Scholar
Furche T, Gottlob G, Grasso G, Schallhart C, Sellers A (2013) OXPAth: a language for scalable data extraction, automation, and crawling on the deep web. The International Journal on Very Large Data Bases 22(1):47–72
Article Google Scholar
Gulhane P, Madaan A, Mehta R, Ramamirtham J, Rastogi R, Satpal S, Sengamedu SH, Tengli A, Tiwari C (2011) Web-scale information extraction with vertex. In: Proceedings of the IEEE 27th international conference on data engineering, pp 1209–1220
Gupta R, Sarawagi S (2011) Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited. In: Proceedings of the fourth ACM international conference on Web search and data mining, pp 217–226
Jiménez P, Corchuelo R (2016) On learning web information extraction rules with TANGO. Inf Syst J 66:74–103
Article Google Scholar
Kayed M, Chang CH (2010) Fivatech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263
Article Google Scholar
Lu Y, He H, Zhao H, Meng W, Yu C (2013) Annotating search results from web databases. IEEE Trans Knowl Data Eng 25(3):514–527
Article Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Article Google Scholar
Omari A, Kimelfeld B, Yahav E, Shoham S (2016) Lossless separation of web pages into layout code and data. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1805– 1814
Omari A, Shoham S, Yahav E (2017) Synthesis of forgiving data extractors. In: Proceedings of the tenth ACM international conference on web search and data mining, pp 385– 394
Ortona S, Orsi G, Furche T, Buoncristiano M (2016) Joint repairs for web wrappers. In: Proceedings of IEEE 32nd international conference on data engineering, pp 1146–1157
Qu J, Ouyang D, Hua W, Ye Y, Zhou X (2019) Discovering correlations between sparse features in distant supervision for relation extraction. In: Proceedings of the twelfth ACM international conference on web search and data mining, pp 726–734
Ratner AJ, Bach SH, Ehrenberg HR, Ré C (2017) Snorkel: fast training set generation for information extraction. In: Proceedings of the 2017 ACM international conference on management of data, pp 1683–1686
Shi S, Liu C, Shen Y, Yuan C, Huang Y (2015) AutoRM: an effective approach for automatic Web data record mining. Knowl-Based Syst 89:314–331
Article Google Scholar
Sleiman HA, Corchuelo R (2013) Tex: an efficient and effective unsupervised web information extractor. Knowl-Based Syst 39:109–123
Article Google Scholar
Sleiman HA, Corchuelo R (2014) Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans Knowl Data Eng 26(6):1544–1556
Article Google Scholar
Song X, Liu J, Cao Y, Lin CY, Hon HW (2010) Automatic extraction of web data records containing user-generated content. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 39–48
Su W, Wang J, Lochovsky FH, Liu Y (2012) Combining tag and value similarity for data extraction and alignment. IEEE Trans Knowl Data Eng 24(7):1186–1200
Article Google Scholar
Tim F, Georg G, Giovanni G, Xiaonan G, Giorgio O, Christian S, Cheng W (2014) DIADEM: thousands of websites to a single database. In: Proceedings of the VLDB, vol 7, pp 1845– 1856
Xie X, Fang Y, Zhang Z, Li L (2012) Extracting data records from web using suffix tree. In: Proceedings of the ACM SIGKDD workshop on mining data semantics, p 12
Yuliana OY, Chang CH (2018) A novel alignment algorithm for effective web data extraction from singleton-item pages. Appl Intell 48(11):4355–4370
Article Google Scholar
Zhai Y, Liu B (2006) Structured data extraction from the web based on partial tree alignment. IEEE Trans Knowl Data Eng 18(12):1614–1628
Article Google Scholar
Zhao C, Zhang R, Qi J (2018) Web page template and data separation for better maintainability. In: Proceedings of international conference on web information systems engineering, pp 439–449

Download references

Acknowledgements

The research is supported by Ministry of Science and Technology Taiwan under Grant MOST105-2628-E-008-004-MY2.

Author information

Authors and Affiliations

CSIE, National Central University, Taoyuan, 32001, Taiwan
Oviliani Yenty Yuliana & Chia-Hui Chang

Authors

Oviliani Yenty Yuliana
View author publications
You can also search for this author in PubMed Google Scholar
Chia-Hui Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chia-Hui Chang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

The magenta numbers in brackets represent the gap with the ground truth answer.

Table 4 Extraction output comparison on TableA

Full size table

Table 5 Extraction output comparison on recordsets

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuliana, O.Y., Chang, CH. DCADE: divide and conquer alignment with dynamic encoding for full page data extraction. Appl Intell 50, 271–295 (2020). https://doi.org/10.1007/s10489-019-01499-0

Download citation

Published: 22 July 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s10489-019-01499-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Abstract

Access this article

Similar content being viewed by others

A novel alignment algorithm for effective web data extraction from singleton-item pages

Efficient Page-Level Data Extraction via Schema Induction and Verification

Automatic Extraction of Logical Web Lists

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Abstract

Access this article

Similar content being viewed by others

A novel alignment algorithm for effective web data extraction from singleton-item pages

Efficient Page-Level Data Extraction via Schema Induction and Verification

Automatic Extraction of Logical Web Lists

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation