A novel alignment algorithm for effective web data extraction from singleton-item pages

Yuliana, Oviliani Yenty; Chang, Chia-Hui

doi:10.1007/s10489-018-1208-0

A novel alignment algorithm for effective web data extraction from singleton-item pages

Published: 15 June 2018

Volume 48, pages 4355–4370, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Oviliani Yenty Yuliana¹ &
Chia-Hui Chang¹

345 Accesses
8 Citations
Explore all metrics

Abstract

Automatic data extraction from template pages is an essential task for data integration and data analysis. Most researches focus on data extraction from list pages. The problem of data alignment for singleton item pages (singleton pages for short), which contain detail information of a single item is less addressed and is more challenging because the number of data attributes to be aligned is much larger than list pages. In this paper, we propose a novel alignment algorithm working on leaf nodes from the DOM trees of input pages for singleton pages data extraction. The idea is to detect mandatory templates via the longest increasing sequence from the landmark equivalence class leaf nodes and recursively apply the same procedure to each segment divided by mandatory templates. By this divide-and-conquer approach, we are able to efficiently conduct local alignment for each segment, while effectively handle multi-order attribute-value pairs with a two-pass procedure. The results show that the proposed approach (called Divide-and-Conquer Alignment, DCA) outperforms TEX (Sleiman and Corchuelo 2013) and WEIR (Bronzi et al. VLDB 6(10):805–816 2013) 2% and 12% on selected items of TEX and WEIR dataset respectively. The improvement is more obvious in terms of full schema evaluation, with 0.95 (DCA) versus 0.63 (TEX) F-measure, on 26 websites from TEX and EXALG (Arasu and Molina 2003).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Article 22 July 2019

Efficient Page-Level Data Extraction via Schema Induction and Verification

Site-Level Web Template Extraction Based on DOM Analysis

Notes

http://nekohtml.sourceforge.net CyberNeko HTML Parser, accessed 10 January 2017
DecorativeTag ≡{a,b,big,cite,dfn,font,em,i,mark,small, span,sub,sup,strike,u,strong}
http://www.dia.uniroma3.it/db/weir
http://www.tdg-seville.info/Hassan/TEX
http://infolab.stanford.edu/arvind/extract/

References

Arasu A, Molina HG (2003) Extracting structured data from Web pages. In: SIGMOD, pp 337–348
Augenstein I, Maynard D, Ciravegna F (2016) Distantly supervised web relation extraction for knowledge base population. Semantic Web 7(4):335–349
Article Google Scholar
Bing L, Lam W, Wong TL (2013) Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: Web search and data mining, pp 567–576
Bronzi M, Crescenzi V, Merialdo P, Papotti P (2013) Extraction anintegration of partially overlapping web sources. VLDB 6(10):805–816
Google Scholar
Chang CH, Kayed M, Girgis MR, Shaalan KF (2010) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428
Article Google Scholar
Chu X, He Y, Chakrabarti K, Ganjam K (2015) Tegra: table extraction by global record alignment. In: SIGMOD, pp 1713–1728
Cortez E, da Silva AS, Gonçalves MA, de Moura ES (2010) Ondux: on-demand unsupervised learning for information extraction. In: SIGMOD, pp 807–818
Cortez E, Oliveira D, da Silva AS et al. (2011) Joint unsupervised structure discovery and information extraction. In: SIGMOD, pp 541–552
Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM 51(5):731–779
Article MathSciNet Google Scholar
Crescenzi V, Merialdo P, Qiu D (2013) Alfred: crowd assisted data extraction. In: WWW, pp 297–300
Dalvi BB, Cohen WW, Callan J (2012) Websets: extracting sets of entities from the web using unsupervised information extraction. In: Web search and data mining, pp 243–252
Dhillon PS, Sellamanickam S, Selvaraj SK (2011) Semi-supervised multi-task learning of structured prediction models for web information extraction. In: Information and knowledge management, pp 957–966
Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl-Based Syst 70:301–323
Article Google Scholar
Fossati M, Dorigatti E, Giuliano C (2017) N-ary relation extraction for simultaneous T-box and A-box knowledge base augmentation. Semantic Web, 1–27
Hao Q, Cai R, Pang Y, Zhang L (2011) From one tree to a forest: a unified solution for structured web data extraction. In: SIGIR, pp 775–784
He B, Patel M, Zhang Z, Chang KCC (2007) Accessing the deep web. Commun ACM 50(5):94–101
Article Google Scholar
Ibrahim Y, Riedewald M, Weikum G (2016) Making sense of entities and quantities in web tables. In: CIKM, pp 1703–1712
Jou C (2015) Semantics-assisted deep web query interface classification. In: Computer science & software engineering, pp 70–78
Kayed M, Chang CH (2010) FiVaTech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263
Article Google Scholar
Lu Y, He H, Zhao H et al. (2013) Annotating search results from web databases. IEEE Trans Knowl Data Eng 25(3):514–527
Article Google Scholar
Sarawagi S, Chakrabarti S (2014) Open-domain quantity queries on web tables: annotation, response, and consensus models. In: SIGKDD, pp 711–720
Sequeda JF, Arenas M, Miranker DP (2012) On directly mapping relational databases to RDF and OWL. In: WWW, pp 649–658
Sleiman HA, Corchuelo R (2013) TEX: an efficient and effective unsupervised web information extractor. Knowl-Based Syst 39:109–123
Article Google Scholar
Su W, Wang J, Lochovsky FH, Liu Y (2012) Combining tag and jouvalue similarity for data extraction and alignment. IEEE Trans Knowl Data Eng 24(7):1186–1200
Article Google Scholar
Vieira K, da Costa Carvalho AL, Berlt K et al. (2009) On finding templates on web collections. WWW J 12(2):171–211
Google Scholar
Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. In: WWW, pp 971–980
Wu S, Liu J, Fan J (2015) Automatic web content extraction by combination of learning and grouping. In: WWW. ACM, pp 1264–1274
Yuliana OY, Chang CH (2016) AFIS: aligning detail-pages for full schema induction. In: TAAI, pp 220–227
Zhai Y, Liu B (2006) Structured data extraction from the web based on partial tree alignment. IEEE Trans Knowl Data Eng 18(12):1614–1628
Article Google Scholar
Zheng X, Gu Y, Li Y (2012) Data extraction from web pages based on structural-semantic entropy. In: WWW, pp 93–102

Download references

Acknowledgements

This research is supported by Ministry of Science and Technology Taiwan, under grant MOST 105-2628-E-008-004-MY2.

Author information

Authors and Affiliations

CSIE, National Central University, Taoyuan, 32001, Taiwan
Oviliani Yenty Yuliana & Chia-Hui Chang

Authors

Oviliani Yenty Yuliana
View author publications
You can also search for this author in PubMed Google Scholar
Chia-Hui Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chia-Hui Chang.

Additional information

Extended paper of TAAI 2016 [28]

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuliana, O.Y., Chang, CH. A novel alignment algorithm for effective web data extraction from singleton-item pages. Appl Intell 48, 4355–4370 (2018). https://doi.org/10.1007/s10489-018-1208-0

Download citation

Published: 15 June 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s10489-018-1208-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel alignment algorithm for effective web data extraction from singleton-item pages

Abstract

Access this article

Similar content being viewed by others

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Efficient Page-Level Data Extraction via Schema Induction and Verification

Site-Level Web Template Extraction Based on DOM Analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel alignment algorithm for effective web data extraction from singleton-item pages

Abstract

Access this article

Similar content being viewed by others

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Efficient Page-Level Data Extraction via Schema Induction and Verification

Site-Level Web Template Extraction Based on DOM Analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation