Skip to main content
Log in

Web data extraction based on structural similarity

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document can be represented as avector of schema, it can be easily incorporated into existing systems as the fabric for integration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Akutsu T (1992) An RNC algorithm for finding a largest common subtree of two trees. IEICE Trans Inf Syst E75-D:95–101

  2. Arasu A, Garcia-Molina H (2003) Extracting structured data from web pp. In: Proceedings of SIGMOD conference 2003, pp 337–348

  3. Baumgartner R, Flesca S, Gottlob G (2001) Visual Web information extraction with lixto. In: Proceedings of 27th international conference on VLDB, pp 119–128

  4. Chang C-H, Lui S-C (2001) IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th international WWW conference, pp 681–688

  5. Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of 27th international conference on VLDB, pp 109–118

  6. Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2002) Detecting structural similarities between XML documents. In: Proceedings of 5th international workshop on the Web and databases

  7. Gottlob G, Koch C (2000) Monadic datalog and the expressive power of languages for web information extraction. In: Proceedings of the 21st PODS, pp 17–28

  8. Gupta A, Harinarayan V, Rajaraman A (1998) Virtual database technology. In: Proceedings of the 14th international conference on data engineering, pp 297–301

  9. Karypis G (2002) A clustering toolkit. Technical report TR#2-017, Univ Minnesota

  10. Kosala R, Bruynooghe M, Blokceel H, Van den Bussche J (2003) Information extraction from web documents based on local unranked tree automaton inference. In: Proceedings of the 18th IJCAI-2003

  11. Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118(1–2):15–68

    Google Scholar 

  12. Kushmerick N, Thomas B (2002) Adaptive information extraction: core technologies for information agents. In: Intelligent information agents R&D in Europe: an agentlink perspective

  13. Lin S-H, Ho J-M (2002) Discovering informative content blocks from web documents. In: Proceedings of SIGKDD

  14. Liu Z, Li F, Ng WK (2002) Wiccap data model: mapping physical websites to logical views. In: Proceedings of the 21st international conference on conceptual modelling (ER2002)

  15. Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings of 5th international workshop on the web and databases

  16. Rajaraman A, Ullman JD (2001) Querying websites using compact skeletons. In: Proceedings of PODS

  17. Sakamoto H, Murakami Y, Arimura H, Arikawa S (2001) Extracting partial structures from html documents. In: 14th international Florida artificial intelligence research symposium (FLAIRS’2001) conference, pp 264–268

  18. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining

  19. Termier A, Rousset M-C, Sebag M (2002) Treefinder: a first step towards SML data mining. In: Proceedings of IEEE ICDM

  20. Lian W, Cheung DW-L (2004) An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans Knowl Data Eng 16(1):82–96

    Article  Google Scholar 

  21. Zaki MJ, Aggarwal CC (2003) Xrules: an effective structural classifier for XML data. In: Proceedings of SIGKDD 03

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhao Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Z., Ng, W. & Sun, A. Web data extraction based on structural similarity. Knowl Inf Syst 8, 438–461 (2005). https://doi.org/10.1007/s10115-004-0188-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-004-0188-z

Keywords

Navigation