Abstract
It is essential for Web applications such as e-commerce portals to enrich their existing content offerings by aggregating relevant structured data (e.g., product reviews) from external Web resources. To meet this goal, in this paper, we present an algorithm for automatically extracting data records from Web pages. The algorithm uses a robust string matching technique for accurately identifying the records in the Webpage. Our experiments on diverse datasets (including datasets from third-party research projects) show that the proposed algorithm is highly effective and performs considerably better than two other state-of-the-art automatic data extraction systems. We made the proposed system publicly accessible in order for the readers to evaluate it.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Mundluru, D.: Automatically Constructing Wrappers for Effective and Efficient Web Information Extraction. PhD thesis, University of Louisiana at Lafayette (2008)
Muslea, I., Minton, S., Knoblock, C.: A Hierarchical Approach to Wrapper. In: Proceedings of the 3rd International Conference on Autonomous Agents, Seattle, pp. 190–197 (1999)
Mundluru, D., Xia, S.: Experiences in Crawling Deep Web in the Context of Local Search. In: Proceedings of the 5th Workshop on Geographical Information Retrieval, Napa Valley (2008)
Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: Proceedings of the ACM International Conference on Knowledge Discovery & Data Mining, Washington, D.C, pp. 601–606 (2003)
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully Automatic Wrapper Generation for Search Engines. In: Proceedings of the 14th International World Wide Web Conference, Chiba, pp. 66–75 (2005), http://www.data.binghamton.edu:8080/vints/
Hall, P., Dowling, G.: Approximate String Matching. ACM Computing Surveys 12(4), 381–402 (1980)
Buttler, D., Liu, L., Pu, C.: A Fully Automated Extraction System for the World Wide Web. In: Proceedings of the International Conference on Distributed Computing Systems, Phoenix, pp. 361–370 (2001)
ISE. A Repository of Online Information Sources Used in Information Extraction Tasks, University of Southern California (1998), www.isi.edu/info-agents/RISE/index.html
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Rome, pp. 109–118 (2001)
PIE Demo System, http://www.fatneuron.com/pie/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mundluru, D., Raghavan, V.V., Wu, Z. (2010). Automatically Extracting Web Data Records. In: An, A., Lingras, P., Petty, S., Huang, R. (eds) Active Media Technology. AMT 2010. Lecture Notes in Computer Science, vol 6335. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15470-6_51
Download citation
DOI: https://doi.org/10.1007/978-3-642-15470-6_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15469-0
Online ISBN: 978-3-642-15470-6
eBook Packages: Computer ScienceComputer Science (R0)