Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

Álvarez, Manuel; Pan, Alberto; Raposo, Juan; Bellas, Fernando; Cacheda, Fidel

doi:10.1007/978-3-540-76993-4_18

Manuel Álvarez¹,
Alberto Pan¹,
Juan Raposo¹,
Fernando Bellas¹ &
…
Fidel Cacheda¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4831))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1204 Accesses

Abstract

Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have tested our techniques with a high number of real web sources and we have found them to be very effective.

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A survey of methods for the extraction of information from Web resources

Article 16 September 2016

Web Content Extraction Using Clustering with Web Structure

Main Content Extraction from Heterogeneous Webpages

References

Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proc. of the ACM SIGMOD Int. Conf. on Management of Data (2003)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: VLDB. Proc. of Very Large DataBases (2001)
Google Scholar
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)
Google Scholar
Chang, C., Lui, S.: IEPAD: Information extraction based on pattern discovery. In: Proc. of 2001 Int. World Wide Web Conf., pp. 681–688 (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: ROADRUNNER: Towards automatic data extraction from large web sites. In: Proc. of the 2001 Int. VLDB Conf., pp. 109–118 (2001)
Google Scholar
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New Indices for Text: Pat trees and Pat Arrays. In: Information Retrieval: Data Structures and Algorithms, Prentice Hall, Englewood Cliffs (1992)
Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record 31(2), 84–93 (2002)
Article Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)
MathSciNet Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. In: Autonomous Agents and Multi-Agent Systems, pp. 93–114 (2001)
Google Scholar
Notredame, C.: Recent Progresses in Multiple Sequence Alignment: A Survey. Technical report, Information Genetique et. (2002)
Google Scholar
Pan, A., et al.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: EISIC. Proc. of IFIP WG8.1 Conf. on Engineering Inf. Systems in the Internet Context (2002)
Google Scholar
Raposo, J., Pan, A., Álvarez, M., Hidalgo, J.: Automatically Maintaining Wrappers for Web Sources. Data & Knowledge Engineering 61(2), 331–358 (2007)
Article Google Scholar
Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 318–331. Springer, Heidelberg (2005)
Chapter Google Scholar
Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information and Communications Technologies, University of A Coruña, Campus de Elviña s/n. 15071. A Coruña, Spain
Manuel Álvarez, Alberto Pan, Juan Raposo, Fernando Bellas & Fidel Cacheda

Authors

Manuel Álvarez
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Pan
View author publications
You can also search for this author in PubMed Google Scholar
Juan Raposo
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Bellas
View author publications
You can also search for this author in PubMed Google Scholar
Fidel Cacheda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Boualem Benatallah Fabio Casati Dimitrios Georgakopoulos Claudio Bartolini Wasim Sadiq Claude Godart

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F. (2007). Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds) Web Information Systems Engineering – WISE 2007. WISE 2007. Lecture Notes in Computer Science, vol 4831. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76993-4_18

Download citation

DOI: https://doi.org/10.1007/978-3-540-76993-4_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76992-7
Online ISBN: 978-3-540-76993-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

Abstract

Access this chapter

Preview

Similar content being viewed by others

A survey of methods for the extraction of information from Web resources

Web Content Extraction Using Clustering with Web Structure

Main Content Extraction from Heterogeneous Webpages

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

Abstract

Access this chapter

Preview

Similar content being viewed by others

A survey of methods for the extraction of information from Web resources

Web Content Extraction Using Clustering with Web Structure

Main Content Extraction from Heterogeneous Webpages

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation