Extracting Data Records from Query Result Pages Based on Visual Features

Weng, Daiyue; Hong, Jun; Bell, David A.

doi:10.1007/978-3-642-24577-0_16

Daiyue Weng¹⁷,
Jun Hong¹⁷ &
David A. Bell¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7051))

Included in the following conference series:

British National Conference on Databases

621 Accesses

Abstract

Web databases contain a large amount of structured data which are accessible via their query interfaces only. Query results are presented in dynamically generated web pages, usually in the form of data records, for human use. The problem of automatically extracting data records from query result pages is critical for web data integration applications, such as comparison shopping sites, meta-search engines, etc. A number of approaches to query result extraction have been proposed. As the structures of web pages become more complex, these approaches start to fail. Query result pages usually also contain other types of information in addition to query results, e.g., advertisements, navigation bar, etc. Most of the existing approaches do not remove such irrelevant contents which may affect the accuracy of data record extraction. We have observed that query results are usually displayed in regular visual patterns and terms used in a query often re-appear in query results. We propose a novel approach that makes use of visual features and query terms to identify the data section and extract data records from it. We also use several content and visual features of visual blocks in a data section to filter out noisy blocks. The results of our experiments on a large set of query result pages in different domains show that our proposed approach is highly effective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Information Extraction from the Web by Matching Visual Presentation Patterns

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Article Open access 07 June 2018

When Different Is Wrong: Visual Unsupervised Validation for Web Information Extraction

References

Gatterbauer, W., Bohunsky, P., Herzog, M., Krupl, B., Pollak, B.: Towards Domain-Independent Information Extraction from Web Tables. In: WWW 2007, pp. 71–80 (2007)
Google Scholar
Gatterbauer, W., Bohunsky, P.: Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model. In: AAAI 2006, pp. 1313–1318 (2006)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: KDD 2003, pp. 601–606 (2003)
Google Scholar
Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: WWW 2005, pp. 76–85 (2005)
Google Scholar
Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Trans. on Knowl. and Data Eng. 18(12), 1614–1628 (2006)
Article Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: WWW 2005, pp. 66–75 (2005)
Google Scholar
Zhao, H., Meng, W., Yu, C.: Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages. In: VLDB 2006, pp. 989–1000 (2006)
Google Scholar
Simon, K., Lausen, G.: ViPER: Augmenting Automatic Information Extraction with Visual Perceptions. In: CIKM 2005, pp. 381–388 (2005)
Google Scholar
Miao, G., Tatemura, J., Hsiung, W., Sawires, A., Moser, L.E.: Extracting Data Records from the Web Using Tag Path Clustering. In: WWW 2009, pp. 981–990 (2009)
Google Scholar
Liu, B., Zhai, Y.: NET - A System for Extracting Web Data from Flat and Nested Data Records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005)
Chapter Google Scholar
Zhu, J., Nie, Z., Wen, J., Zhang, B., Ma, W.: Simultaneous Record Detection and Attribute Labeling in Web Data Extraction. In: KDD 2006, pp. 494–503 (2006)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: VLDB 2001, pp. 109–118 (2001)
Google Scholar
Chang, C.-H., Lui, S.-C.: IEPAD: Information Extraction Based on Pattern Discovery. In: 10th International Conference on World Wide Web, pp. 681–688. ACM, New York (2001)
Google Scholar
Wang, J., Lochovsky, F.H.: Data Extraction and Label Assignment for Web Databases. In: WWW 2003, pp. 187–196 (2003)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: SIGMOD 2003, pp. 337–348 (2003)
Google Scholar
Liu, W., Meng, X.F., Meng, W.Y.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Trans. on Knowl. and Data Eng. 22(3), 447–460 (2010)
Article Google Scholar
Cai, D., Yu, S., Wen, J., Ma, W.: Extracting Content Structure for Web Pages Based on Visual Representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Chapter Google Scholar
Wang, J., Wen, J., Lochovsky, F., Ma, W.: Instance-based schema matching for web databases by domain-specific query probing. In: VLDB 2004, pp. 408–419 (2004)
Google Scholar
The UIUC Web Integration Repository, http://metaquerier.cs.uiuc.edu/repository/
Madhavan, J., Jeffery, S.R., Cohen, S., Dong, X.L., Ko, D., Yu, C., Halevy, A.: Web-scale Data Integration: You Can Only Aford to Pay as You Go. In: CIDR 2007, pp. 342–350 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, BT7 1NN, UK
Daiyue Weng, Jun Hong & David A. Bell

Authors

Daiyue Weng
View author publications
You can also search for this author in PubMed Google Scholar
Jun Hong
View author publications
You can also search for this author in PubMed Google Scholar
David A. Bell
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Manchester, M13 9PL, Manchester, UK
Alvaro A. A. Fernandes , Alasdair J. G. Gray & Khalid Belhajjame , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Weng, D., Hong, J., Bell, D.A. (2011). Extracting Data Records from Query Result Pages Based on Visual Features. In: Fernandes, A.A.A., Gray, A.J.G., Belhajjame, K. (eds) Advances in Databases. BNCOD 2011. Lecture Notes in Computer Science, vol 7051. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24577-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-24577-0_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24576-3
Online ISBN: 978-3-642-24577-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics