Automatically Extracting Web Data Records

Mundluru, Dheerendranath; Raghavan, Vijay V.; Wu, Zonghuan

doi:10.1007/978-3-642-15470-6_51

Dheerendranath Mundluru^20,21,22,
Vijay V. Raghavan^20,21,22 &
Zonghuan Wu^20,21,22

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6335))

Included in the following conference series:

International Conference on Active Media Technology

1106 Accesses
1 Citations

Abstract

It is essential for Web applications such as e-commerce portals to enrich their existing content offerings by aggregating relevant structured data (e.g., product reviews) from external Web resources. To meet this goal, in this paper, we present an algorithm for automatically extracting data records from Web pages. The algorithm uses a robust string matching technique for accurately identifying the records in the Webpage. Our experiments on diverse datasets (including datasets from third-party research projects) show that the proposed algorithm is highly effective and performs considerably better than two other state-of-the-art automatic data extraction systems. We made the proposed system publicly accessible in order for the readers to evaluate it.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

STEM: a suffix tree-based method for web data records extraction

Article 09 May 2017

A survey of methods for the extraction of information from Web resources

Article 16 September 2016

An Intelligent and Automated Web Data Extraction System for E-commerce

References

Mundluru, D.: Automatically Constructing Wrappers for Effective and Efficient Web Information Extraction. PhD thesis, University of Louisiana at Lafayette (2008)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: A Hierarchical Approach to Wrapper. In: Proceedings of the 3rd International Conference on Autonomous Agents, Seattle, pp. 190–197 (1999)
Google Scholar
Mundluru, D., Xia, S.: Experiences in Crawling Deep Web in the Context of Local Search. In: Proceedings of the 5th Workshop on Geographical Information Retrieval, Napa Valley (2008)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: Proceedings of the ACM International Conference on Knowledge Discovery & Data Mining, Washington, D.C, pp. 601–606 (2003)
Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully Automatic Wrapper Generation for Search Engines. In: Proceedings of the 14th International World Wide Web Conference, Chiba, pp. 66–75 (2005), http://www.data.binghamton.edu:8080/vints/
Hall, P., Dowling, G.: Approximate String Matching. ACM Computing Surveys 12(4), 381–402 (1980)
Article MathSciNet Google Scholar
Buttler, D., Liu, L., Pu, C.: A Fully Automated Extraction System for the World Wide Web. In: Proceedings of the International Conference on Distributed Computing Systems, Phoenix, pp. 361–370 (2001)
Google Scholar
ISE. A Repository of Online Information Sources Used in Information Extraction Tasks, University of Southern California (1998), www.isi.edu/info-agents/RISE/index.html
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Rome, pp. 109–118 (2001)
Google Scholar
PIE Demo System, http://www.fatneuron.com/pie/

Download references

Author information

Authors and Affiliations

IMshopping Inc., Santa Clara, USA
Dheerendranath Mundluru, Vijay V. Raghavan & Zonghuan Wu
University of Louisiana at Lafayette, Lafayette, USA
Dheerendranath Mundluru, Vijay V. Raghavan & Zonghuan Wu
Huawei Technologies Corp., Santa Clara, USA
Dheerendranath Mundluru, Vijay V. Raghavan & Zonghuan Wu

Authors

Dheerendranath Mundluru
View author publications
You can also search for this author in PubMed Google Scholar
Vijay V. Raghavan
View author publications
You can also search for this author in PubMed Google Scholar
Zonghuan Wu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, York University, M3J 1P3, Toronto, ON, Canada
Aijun An
Department of Mathematics and Computing Science, Saint Mary’s University, B3H 3C3, Halifax, NS, Canada
Pawan Lingras
Faculty of Fine Arts, University of Regina, 3737 Wascana Parkway, S4S 0A2, Regina, SK, Canada
Sheila Petty
Faculty of Computer and Information Sciences, Hosei University, 3-7-2, Kajino-cho, Koganei-shi, 184-8584, Tokyo, Japan
Runhe Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mundluru, D., Raghavan, V.V., Wu, Z. (2010). Automatically Extracting Web Data Records. In: An, A., Lingras, P., Petty, S., Huang, R. (eds) Active Media Technology. AMT 2010. Lecture Notes in Computer Science, vol 6335. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15470-6_51

Download citation

DOI: https://doi.org/10.1007/978-3-642-15470-6_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15469-0
Online ISBN: 978-3-642-15470-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics