Skip to main content

Automatically Extracting Web Data Records

  • Conference paper
Active Media Technology (AMT 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6335))

Included in the following conference series:

Abstract

It is essential for Web applications such as e-commerce portals to enrich their existing content offerings by aggregating relevant structured data (e.g., product reviews) from external Web resources. To meet this goal, in this paper, we present an algorithm for automatically extracting data records from Web pages. The algorithm uses a robust string matching technique for accurately identifying the records in the Webpage. Our experiments on diverse datasets (including datasets from third-party research projects) show that the proposed algorithm is highly effective and performs considerably better than two other state-of-the-art automatic data extraction systems. We made the proposed system publicly accessible in order for the readers to evaluate it.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Mundluru, D.: Automatically Constructing Wrappers for Effective and Efficient Web Information Extraction. PhD thesis, University of Louisiana at Lafayette (2008)

    Google Scholar 

  2. Muslea, I., Minton, S., Knoblock, C.: A Hierarchical Approach to Wrapper. In: Proceedings of the 3rd International Conference on Autonomous Agents, Seattle, pp. 190–197 (1999)

    Google Scholar 

  3. Mundluru, D., Xia, S.: Experiences in Crawling Deep Web in the Context of Local Search. In: Proceedings of the 5th Workshop on Geographical Information Retrieval, Napa Valley (2008)

    Google Scholar 

  4. Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: Proceedings of the ACM International Conference on Knowledge Discovery & Data Mining, Washington, D.C, pp. 601–606 (2003)

    Google Scholar 

  5. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully Automatic Wrapper Generation for Search Engines. In: Proceedings of the 14th International World Wide Web Conference, Chiba, pp. 66–75 (2005), http://www.data.binghamton.edu:8080/vints/

  6. Hall, P., Dowling, G.: Approximate String Matching. ACM Computing Surveys 12(4), 381–402 (1980)

    Article  MathSciNet  Google Scholar 

  7. Buttler, D., Liu, L., Pu, C.: A Fully Automated Extraction System for the World Wide Web. In: Proceedings of the International Conference on Distributed Computing Systems, Phoenix, pp. 361–370 (2001)

    Google Scholar 

  8. ISE. A Repository of Online Information Sources Used in Information Extraction Tasks, University of Southern California (1998), www.isi.edu/info-agents/RISE/index.html

  9. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Rome, pp. 109–118 (2001)

    Google Scholar 

  10. PIE Demo System, http://www.fatneuron.com/pie/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mundluru, D., Raghavan, V.V., Wu, Z. (2010). Automatically Extracting Web Data Records. In: An, A., Lingras, P., Petty, S., Huang, R. (eds) Active Media Technology. AMT 2010. Lecture Notes in Computer Science, vol 6335. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15470-6_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15470-6_51

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15469-0

  • Online ISBN: 978-3-642-15470-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics