Skip to main content
Log in

A novel alignment algorithm for effective web data extraction from singleton-item pages

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Automatic data extraction from template pages is an essential task for data integration and data analysis. Most researches focus on data extraction from list pages. The problem of data alignment for singleton item pages (singleton pages for short), which contain detail information of a single item is less addressed and is more challenging because the number of data attributes to be aligned is much larger than list pages. In this paper, we propose a novel alignment algorithm working on leaf nodes from the DOM trees of input pages for singleton pages data extraction. The idea is to detect mandatory templates via the longest increasing sequence from the landmark equivalence class leaf nodes and recursively apply the same procedure to each segment divided by mandatory templates. By this divide-and-conquer approach, we are able to efficiently conduct local alignment for each segment, while effectively handle multi-order attribute-value pairs with a two-pass procedure. The results show that the proposed approach (called Divide-and-Conquer Alignment, DCA) outperforms TEX (Sleiman and Corchuelo 2013) and WEIR (Bronzi et al. VLDB 6(10):805–816 2013) 2% and 12% on selected items of TEX and WEIR dataset respectively. The improvement is more obvious in terms of full schema evaluation, with 0.95 (DCA) versus 0.63 (TEX) F-measure, on 26 websites from TEX and EXALG (Arasu and Molina 2003).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://nekohtml.sourceforge.net CyberNeko HTML Parser, accessed 10 January 2017

  2. DecorativeTag ≡{a,b,big,cite,dfn,font,em,i,mark,small, span,sub,sup,strike,u,strong}

  3. http://www.dia.uniroma3.it/db/weir

  4. http://www.tdg-seville.info/Hassan/TEX

  5. http://infolab.stanford.edu/arvind/extract/

References

  1. Arasu A, Molina HG (2003) Extracting structured data from Web pages. In: SIGMOD, pp 337–348

  2. Augenstein I, Maynard D, Ciravegna F (2016) Distantly supervised web relation extraction for knowledge base population. Semantic Web 7(4):335–349

    Article  Google Scholar 

  3. Bing L, Lam W, Wong TL (2013) Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: Web search and data mining, pp 567–576

  4. Bronzi M, Crescenzi V, Merialdo P, Papotti P (2013) Extraction anintegration of partially overlapping web sources. VLDB 6(10):805–816

    Google Scholar 

  5. Chang CH, Kayed M, Girgis MR, Shaalan KF (2010) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428

    Article  Google Scholar 

  6. Chu X, He Y, Chakrabarti K, Ganjam K (2015) Tegra: table extraction by global record alignment. In: SIGMOD, pp 1713–1728

  7. Cortez E, da Silva AS, Gonçalves MA, de Moura ES (2010) Ondux: on-demand unsupervised learning for information extraction. In: SIGMOD, pp 807–818

  8. Cortez E, Oliveira D, da Silva AS et al. (2011) Joint unsupervised structure discovery and information extraction. In: SIGMOD, pp 541–552

  9. Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM 51(5):731–779

    Article  MathSciNet  Google Scholar 

  10. Crescenzi V, Merialdo P, Qiu D (2013) Alfred: crowd assisted data extraction. In: WWW, pp 297–300

  11. Dalvi BB, Cohen WW, Callan J (2012) Websets: extracting sets of entities from the web using unsupervised information extraction. In: Web search and data mining, pp 243–252

  12. Dhillon PS, Sellamanickam S, Selvaraj SK (2011) Semi-supervised multi-task learning of structured prediction models for web information extraction. In: Information and knowledge management, pp 957–966

  13. Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl-Based Syst 70:301–323

    Article  Google Scholar 

  14. Fossati M, Dorigatti E, Giuliano C (2017) N-ary relation extraction for simultaneous T-box and A-box knowledge base augmentation. Semantic Web, 1–27

  15. Hao Q, Cai R, Pang Y, Zhang L (2011) From one tree to a forest: a unified solution for structured web data extraction. In: SIGIR, pp 775–784

  16. He B, Patel M, Zhang Z, Chang KCC (2007) Accessing the deep web. Commun ACM 50(5):94–101

    Article  Google Scholar 

  17. Ibrahim Y, Riedewald M, Weikum G (2016) Making sense of entities and quantities in web tables. In: CIKM, pp 1703–1712

  18. Jou C (2015) Semantics-assisted deep web query interface classification. In: Computer science & software engineering, pp 70–78

  19. Kayed M, Chang CH (2010) FiVaTech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263

    Article  Google Scholar 

  20. Lu Y, He H, Zhao H et al. (2013) Annotating search results from web databases. IEEE Trans Knowl Data Eng 25(3):514–527

    Article  Google Scholar 

  21. Sarawagi S, Chakrabarti S (2014) Open-domain quantity queries on web tables: annotation, response, and consensus models. In: SIGKDD, pp 711–720

  22. Sequeda JF, Arenas M, Miranker DP (2012) On directly mapping relational databases to RDF and OWL. In: WWW, pp 649–658

  23. Sleiman HA, Corchuelo R (2013) TEX: an efficient and effective unsupervised web information extractor. Knowl-Based Syst 39:109–123

    Article  Google Scholar 

  24. Su W, Wang J, Lochovsky FH, Liu Y (2012) Combining tag and jouvalue similarity for data extraction and alignment. IEEE Trans Knowl Data Eng 24(7):1186–1200

    Article  Google Scholar 

  25. Vieira K, da Costa Carvalho AL, Berlt K et al. (2009) On finding templates on web collections. WWW J 12(2):171–211

    Google Scholar 

  26. Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. In: WWW, pp 971–980

  27. Wu S, Liu J, Fan J (2015) Automatic web content extraction by combination of learning and grouping. In: WWW. ACM, pp 1264–1274

  28. Yuliana OY, Chang CH (2016) AFIS: aligning detail-pages for full schema induction. In: TAAI, pp 220–227

  29. Zhai Y, Liu B (2006) Structured data extraction from the web based on partial tree alignment. IEEE Trans Knowl Data Eng 18(12):1614–1628

    Article  Google Scholar 

  30. Zheng X, Gu Y, Li Y (2012) Data extraction from web pages based on structural-semantic entropy. In: WWW, pp 93–102

Download references

Acknowledgements

This research is supported by Ministry of Science and Technology Taiwan, under grant MOST 105-2628-E-008-004-MY2.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chia-Hui Chang.

Additional information

Extended paper of TAAI 2016 [28]

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuliana, O.Y., Chang, CH. A novel alignment algorithm for effective web data extraction from singleton-item pages. Appl Intell 48, 4355–4370 (2018). https://doi.org/10.1007/s10489-018-1208-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1208-0

Keywords

Navigation