Skip to main content
Log in

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In this paper, we consider the problem of full schema induction from either multiple list pages or singleton pages with the same template. Existing approaches do not work well for this problem because they use fixed abstraction schemes that are suitable for data-rich detection, but they are not appropriate for small records and complex data found in other sections. We propose an unsupervised full schema web data extraction via Divide-and-Conquer Alignment with Dynamic Encoding (DCADE for short). We define the Content Equivalence Class (CEC) and Typeset Equivalence Class (TEC) based on leaf node content. We then combine HTML attributes (i.e., id and class) in the paths for various levels of encoding, so that the proposed algorithm can align leaf nodes by exploring patterns at various levels from specific to general. We conducted experiments on 49 real-world websites used in TEX and ExAlg. The proposed DCADE achieved a 0.962 F1 measure for non-recordset data extraction (denoted by FD), and a 0.936 F1 measure for recordset data extraction (denoted by FS), which outperformed other page-level web data extraction methods, i.e., DCA (FD= 0.660), TEX (FD= 0.454 and FS= 0.549), RoadRunner (FD= 0.396 and FS= 0.330), and UWIDE (FD= 0.260 and FS= 0.081).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. If the r th page does not contain any leaf node with CECIde, we assign -1 to e.FP[r].

  2. If there exists frequent patterns RT or RD with NSD = 0, we will set 𝜃NSD to 0, otherwise 𝜃NSD = 0.5.

  3. http://www.tdg-seville.info/Hassan/TEX

  4. The detail extraction output of each approach can be found in the A section.

References

  1. Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp 337–348

  2. Bing L, Lam W, Wong TL (2013) Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: Proceedings of the sixth ACM international conference on Web search and data mining, pp 567–576

  3. Bronzi M, Crescenzi V, Merialdo P, Papotti P (2013) Extraction and integration of partially overlapping web sources. VLDB J 6(10):805–816

    Google Scholar 

  4. Carlson A, Betteridge J, Wang RC, Hruschka R, Mitchell TM (2010) Coupled semi-supervised learning for information extraction. In: Proceedings of the third ACM international conference on Web search and data mining, pp 101–110

  5. Chang CH, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428

    Article  Google Scholar 

  6. Chang CH, Chen TS, Chen MC, Ding JL (2016) Efficient page-level data extraction via schema induction and verification. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining, pp 478–490

  7. Chu X, He Y, Chakrabarti K, Ganjam K (2015) Tegra: table extraction by global record alignment. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1713–1728

  8. Crescenzi V, Mecca G (2005) Automatic information extraction from large websites. Journal of the ACM (JACM) 51(5):731–779

    Article  MathSciNet  Google Scholar 

  9. Crescenzi V, Merialdo P, Alfred DQ (2013) Alfred: crowd assisted data extraction. In: Proceedings of the 22nd international conference on World Wide Web, pp 297–300

  10. Dhillon PS, Sellamanickam S, Selvaraj SK (2011) Semi-supervised multi-task learning of structured prediction models for web information extraction. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp 957–966

  11. Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl-Based Syst 70:301–323

    Article  Google Scholar 

  12. Furche T, Gottlob G, Grasso G, Schallhart C, Sellers A (2013) OXPAth: a language for scalable data extraction, automation, and crawling on the deep web. The International Journal on Very Large Data Bases 22(1):47–72

    Article  Google Scholar 

  13. Gulhane P, Madaan A, Mehta R, Ramamirtham J, Rastogi R, Satpal S, Sengamedu SH, Tengli A, Tiwari C (2011) Web-scale information extraction with vertex. In: Proceedings of the IEEE 27th international conference on data engineering, pp 1209–1220

  14. Gupta R, Sarawagi S (2011) Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited. In: Proceedings of the fourth ACM international conference on Web search and data mining, pp 217–226

  15. Jiménez P, Corchuelo R (2016) On learning web information extraction rules with TANGO. Inf Syst J 66:74–103

    Article  Google Scholar 

  16. Kayed M, Chang CH (2010) Fivatech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263

    Article  Google Scholar 

  17. Lu Y, He H, Zhao H, Meng W, Yu C (2013) Annotating search results from web databases. IEEE Trans Knowl Data Eng 25(3):514–527

    Article  Google Scholar 

  18. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

    Article  Google Scholar 

  19. Omari A, Kimelfeld B, Yahav E, Shoham S (2016) Lossless separation of web pages into layout code and data. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1805– 1814

  20. Omari A, Shoham S, Yahav E (2017) Synthesis of forgiving data extractors. In: Proceedings of the tenth ACM international conference on web search and data mining, pp 385– 394

  21. Ortona S, Orsi G, Furche T, Buoncristiano M (2016) Joint repairs for web wrappers. In: Proceedings of IEEE 32nd international conference on data engineering, pp 1146–1157

  22. Qu J, Ouyang D, Hua W, Ye Y, Zhou X (2019) Discovering correlations between sparse features in distant supervision for relation extraction. In: Proceedings of the twelfth ACM international conference on web search and data mining, pp 726–734

  23. Ratner AJ, Bach SH, Ehrenberg HR, Ré C (2017) Snorkel: fast training set generation for information extraction. In: Proceedings of the 2017 ACM international conference on management of data, pp 1683–1686

  24. Shi S, Liu C, Shen Y, Yuan C, Huang Y (2015) AutoRM: an effective approach for automatic Web data record mining. Knowl-Based Syst 89:314–331

    Article  Google Scholar 

  25. Sleiman HA, Corchuelo R (2013) Tex: an efficient and effective unsupervised web information extractor. Knowl-Based Syst 39:109–123

    Article  Google Scholar 

  26. Sleiman HA, Corchuelo R (2014) Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans Knowl Data Eng 26(6):1544–1556

    Article  Google Scholar 

  27. Song X, Liu J, Cao Y, Lin CY, Hon HW (2010) Automatic extraction of web data records containing user-generated content. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 39–48

  28. Su W, Wang J, Lochovsky FH, Liu Y (2012) Combining tag and value similarity for data extraction and alignment. IEEE Trans Knowl Data Eng 24(7):1186–1200

    Article  Google Scholar 

  29. Tim F, Georg G, Giovanni G, Xiaonan G, Giorgio O, Christian S, Cheng W (2014) DIADEM: thousands of websites to a single database. In: Proceedings of the VLDB, vol 7, pp 1845– 1856

  30. Xie X, Fang Y, Zhang Z, Li L (2012) Extracting data records from web using suffix tree. In: Proceedings of the ACM SIGKDD workshop on mining data semantics, p 12

  31. Yuliana OY, Chang CH (2018) A novel alignment algorithm for effective web data extraction from singleton-item pages. Appl Intell 48(11):4355–4370

    Article  Google Scholar 

  32. Zhai Y, Liu B (2006) Structured data extraction from the web based on partial tree alignment. IEEE Trans Knowl Data Eng 18(12):1614–1628

    Article  Google Scholar 

  33. Zhao C, Zhang R, Qi J (2018) Web page template and data separation for better maintainability. In: Proceedings of international conference on web information systems engineering, pp 439–449

Download references

Acknowledgements

The research is supported by Ministry of Science and Technology Taiwan under Grant MOST105-2628-E-008-004-MY2.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chia-Hui Chang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

The magenta numbers in brackets represent the gap with the ground truth answer.

Table 4 Extraction output comparison on TableA
Table 5 Extraction output comparison on recordsets

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuliana, O.Y., Chang, CH. DCADE: divide and conquer alignment with dynamic encoding for full page data extraction. Appl Intell 50, 271–295 (2020). https://doi.org/10.1007/s10489-019-01499-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01499-0

Keywords

Navigation