Abstract
Automatic data extraction from template pages is an essential task for data integration and data analysis. Most researches focus on data extraction from list pages. The problem of data alignment for singleton item pages (singleton pages for short), which contain detail information of a single item is less addressed and is more challenging because the number of data attributes to be aligned is much larger than list pages. In this paper, we propose a novel alignment algorithm working on leaf nodes from the DOM trees of input pages for singleton pages data extraction. The idea is to detect mandatory templates via the longest increasing sequence from the landmark equivalence class leaf nodes and recursively apply the same procedure to each segment divided by mandatory templates. By this divide-and-conquer approach, we are able to efficiently conduct local alignment for each segment, while effectively handle multi-order attribute-value pairs with a two-pass procedure. The results show that the proposed approach (called Divide-and-Conquer Alignment, DCA) outperforms TEX (Sleiman and Corchuelo 2013) and WEIR (Bronzi et al. VLDB 6(10):805–816 2013) 2% and 12% on selected items of TEX and WEIR dataset respectively. The improvement is more obvious in terms of full schema evaluation, with 0.95 (DCA) versus 0.63 (TEX) F-measure, on 26 websites from TEX and EXALG (Arasu and Molina 2003).
Similar content being viewed by others
Notes
http://nekohtml.sourceforge.net CyberNeko HTML Parser, accessed 10 January 2017
DecorativeTag ≡{a,b,big,cite,dfn,font,em,i,mark,small, span,sub,sup,strike,u,strong}
References
Arasu A, Molina HG (2003) Extracting structured data from Web pages. In: SIGMOD, pp 337–348
Augenstein I, Maynard D, Ciravegna F (2016) Distantly supervised web relation extraction for knowledge base population. Semantic Web 7(4):335–349
Bing L, Lam W, Wong TL (2013) Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: Web search and data mining, pp 567–576
Bronzi M, Crescenzi V, Merialdo P, Papotti P (2013) Extraction anintegration of partially overlapping web sources. VLDB 6(10):805–816
Chang CH, Kayed M, Girgis MR, Shaalan KF (2010) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428
Chu X, He Y, Chakrabarti K, Ganjam K (2015) Tegra: table extraction by global record alignment. In: SIGMOD, pp 1713–1728
Cortez E, da Silva AS, Gonçalves MA, de Moura ES (2010) Ondux: on-demand unsupervised learning for information extraction. In: SIGMOD, pp 807–818
Cortez E, Oliveira D, da Silva AS et al. (2011) Joint unsupervised structure discovery and information extraction. In: SIGMOD, pp 541–552
Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM 51(5):731–779
Crescenzi V, Merialdo P, Qiu D (2013) Alfred: crowd assisted data extraction. In: WWW, pp 297–300
Dalvi BB, Cohen WW, Callan J (2012) Websets: extracting sets of entities from the web using unsupervised information extraction. In: Web search and data mining, pp 243–252
Dhillon PS, Sellamanickam S, Selvaraj SK (2011) Semi-supervised multi-task learning of structured prediction models for web information extraction. In: Information and knowledge management, pp 957–966
Ferrara E, De Meo P, Fiumara G, Baumgartner R (2014) Web data extraction, applications and techniques: a survey. Knowl-Based Syst 70:301–323
Fossati M, Dorigatti E, Giuliano C (2017) N-ary relation extraction for simultaneous T-box and A-box knowledge base augmentation. Semantic Web, 1–27
Hao Q, Cai R, Pang Y, Zhang L (2011) From one tree to a forest: a unified solution for structured web data extraction. In: SIGIR, pp 775–784
He B, Patel M, Zhang Z, Chang KCC (2007) Accessing the deep web. Commun ACM 50(5):94–101
Ibrahim Y, Riedewald M, Weikum G (2016) Making sense of entities and quantities in web tables. In: CIKM, pp 1703–1712
Jou C (2015) Semantics-assisted deep web query interface classification. In: Computer science & software engineering, pp 70–78
Kayed M, Chang CH (2010) FiVaTech: page-level web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2):249–263
Lu Y, He H, Zhao H et al. (2013) Annotating search results from web databases. IEEE Trans Knowl Data Eng 25(3):514–527
Sarawagi S, Chakrabarti S (2014) Open-domain quantity queries on web tables: annotation, response, and consensus models. In: SIGKDD, pp 711–720
Sequeda JF, Arenas M, Miranker DP (2012) On directly mapping relational databases to RDF and OWL. In: WWW, pp 649–658
Sleiman HA, Corchuelo R (2013) TEX: an efficient and effective unsupervised web information extractor. Knowl-Based Syst 39:109–123
Su W, Wang J, Lochovsky FH, Liu Y (2012) Combining tag and jouvalue similarity for data extraction and alignment. IEEE Trans Knowl Data Eng 24(7):1186–1200
Vieira K, da Costa Carvalho AL, Berlt K et al. (2009) On finding templates on web collections. WWW J 12(2):171–211
Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. In: WWW, pp 971–980
Wu S, Liu J, Fan J (2015) Automatic web content extraction by combination of learning and grouping. In: WWW. ACM, pp 1264–1274
Yuliana OY, Chang CH (2016) AFIS: aligning detail-pages for full schema induction. In: TAAI, pp 220–227
Zhai Y, Liu B (2006) Structured data extraction from the web based on partial tree alignment. IEEE Trans Knowl Data Eng 18(12):1614–1628
Zheng X, Gu Y, Li Y (2012) Data extraction from web pages based on structural-semantic entropy. In: WWW, pp 93–102
Acknowledgements
This research is supported by Ministry of Science and Technology Taiwan, under grant MOST 105-2628-E-008-004-MY2.
Author information
Authors and Affiliations
Corresponding author
Additional information
Extended paper of TAAI 2016 [28]
Rights and permissions
About this article
Cite this article
Yuliana, O.Y., Chang, CH. A novel alignment algorithm for effective web data extraction from singleton-item pages. Appl Intell 48, 4355–4370 (2018). https://doi.org/10.1007/s10489-018-1208-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1208-0