Skip to main content

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

  • Conference paper
Web Information Systems Engineering – WISE 2007 (WISE 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4831))

Included in the following conference series:

Abstract

Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have tested our techniques with a high number of real web sources and we have found them to be very effective.

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proc. of the ACM SIGMOD Int. Conf. on Management of Data (2003)

    Google Scholar 

  2. Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: VLDB. Proc. of Very Large DataBases (2001)

    Google Scholar 

  3. Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)

    Google Scholar 

  4. Chang, C., Lui, S.: IEPAD: Information extraction based on pattern discovery. In: Proc. of 2001 Int. World Wide Web Conf., pp. 681–688 (2001)

    Google Scholar 

  5. Crescenzi, V., Mecca, G., Merialdo, P.: ROADRUNNER: Towards automatic data extraction from large web sites. In: Proc. of the 2001 Int. VLDB Conf., pp. 109–118 (2001)

    Google Scholar 

  6. Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New Indices for Text: Pat trees and Pat Arrays. In: Information Retrieval: Data Structures and Algorithms, Prentice Hall, Englewood Cliffs (1992)

    Google Scholar 

  7. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record 31(2), 84–93 (2002)

    Article  Google Scholar 

  8. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)

    MathSciNet  Google Scholar 

  9. Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. In: Autonomous Agents and Multi-Agent Systems, pp. 93–114 (2001)

    Google Scholar 

  10. Notredame, C.: Recent Progresses in Multiple Sequence Alignment: A Survey. Technical report, Information Genetique et. (2002)

    Google Scholar 

  11. Pan, A., et al.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: EISIC. Proc. of IFIP WG8.1 Conf. on Engineering Inf. Systems in the Internet Context (2002)

    Google Scholar 

  12. Raposo, J., Pan, A., Álvarez, M., Hidalgo, J.: Automatically Maintaining Wrappers for Web Sources. Data & Knowledge Engineering 61(2), 331–358 (2007)

    Article  Google Scholar 

  13. Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 318–331. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  14. Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Boualem Benatallah Fabio Casati Dimitrios Georgakopoulos Claudio Bartolini Wasim Sadiq Claude Godart

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F. (2007). Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds) Web Information Systems Engineering – WISE 2007. WISE 2007. Lecture Notes in Computer Science, vol 4831. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76993-4_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-76993-4_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-76992-7

  • Online ISBN: 978-3-540-76993-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics