Skip to main content
Log in

Improving recognition accuracy on structured documents by learning structural patterns

  • Original Article
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

In this paper, we present a probabilistic method that can improve the efficiency of document classification when applied to structured documents. The analysis of the structure of a document is the starting point of document classification. Our method is designed to augment other classification schemes and complement pre-filtering information extraction procedures to reduce uncertainties. To this end, a probabilistic distribution on the structure of XML documents is introduced. We show how to parameterise existing learning methods to describe the structure distribution efficiently. The learned distribution is then used to predict the classes of unseen documents. Novelty detection making use of the structure-based distribution function is also discussed. Demonstration on model documents and on Internet XML documents are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2a–d
Fig. 3
Fig. 4
Fig. 5a–e
Fig. 6

Similar content being viewed by others

References

  1. Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Am 284(5):34–43

    Google Scholar 

  2. Abiteboul S, Buneman P, Suciu D (2000) Data on the web: From relations to semistructured data and XML. Morgan Kaufmann, San Francisco, CA

    Google Scholar 

  3. Kosala R, Van den Bussche J, Bruynooghe M, Blockeel H (2002) Information extraction in structured documents using tree automata induction. In: Elomaa T, Mannila H, Toivonen H (eds) Principles of data mining and knowledge discovery: Lecture notes in computer science, vol 2431, 6th European conference, PKDD 2002, Helsinki, Finland, August 2002, pp 299–310

  4. Muslea I, Minton S, Knoblock CA (2001) Hierarchical wrapper induction for semistructured information sources. Auton Agent Multi-Ag 4(1–2):93–114

    Google Scholar 

  5. Sahuguet A, Azavant F (2001) Building intelligent web applications using lightweight wrappers. Data Knowl Eng 36(3):283–316

    Article  MATH  Google Scholar 

  6. Lee D, Chu WW (2000) Comparative analysis of six XML schema languages, SIGMOD Record 29(3):76–87

    Google Scholar 

  7. Bertino E, Guerrini G, Mesiti M, Rivara I, Tavella C (2001) Measuring the structural similarity among XML documents and DTDs

  8. Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41

    Article  MATH  Google Scholar 

  9. Jacquin DC (2001) Indexing a web site with a terminology oriented ontology

  10. Jutten C, Herault J (1991) Blind separation of sources. Part I: An adaptive algorithm based on neuromimetic architecture. Signal Process 24:1–10

    Article  MATH  Google Scholar 

  11. Comon P (1994) Independent component analysis – A new concept? Signal Process 36:287–314

    Article  MATH  Google Scholar 

  12. Cardoso JF, Laheld B (1996) Equivalent adaptive source separation. IEEE T Signal Proces 44:3017–3030

    Article  Google Scholar 

  13. Bell AJ, Sejnowski TJ (1995) An information-maximization approach to blind separation and blind deconvolution. Neural Comput 7:1129–1159

    CAS  PubMed  Google Scholar 

  14. Amari SL, Cichocki A, Yang HH (1996) A new learning algorithm for blind signal separation. Advances in neural information processing systems. Morgan Kaufmann, San Mateo, CA, pp 757–763

  15. Chow CK, Liu CN (1968) Approximating discrete probability distributions with dependence trees. IEEE T Inform Theory 14:462–467

    MATH  Google Scholar 

  16. Meila M, Jordan MI (2000) Learning with mixtures of trees. J Mach Learn Res 1:1–48

    Article  MATH  Google Scholar 

  17. Meila-Predoviciu M (1999) Learning with mixtures of trees. PhD thesis, Massachusetts Institute of Technology, January 1999

  18. Cormen TH, Leiserson CE (1990) Introduction to algorithms. MIT Press, Cambridge, MA

  19. ADC XML resourceshttp://xml.gsfc.nasa.gov

  20. Examples of CMLhttp://www.xml-cml.org/examples

  21. Cohen WW (1999) Recognizing structure in web pages using similarity queries. AAAI/IAAI, pp 59–66

Download references

Acknowledgements

This work was supported by the Hungarian National Science Foundation (Grant OTKA 32487) and by EOARD (Grant F61775–00-WE065). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the European Office of Aerospace Research and Development, Air Force Office of Scientific Research or the Air Force Research Laboratory.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Lőrincz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hévízi, G., Marcinkovics, T. & Lőrincz, A. Improving recognition accuracy on structured documents by learning structural patterns. Pattern Anal Applic 7, 66–76 (2004). https://doi.org/10.1007/s10044-004-0208-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-004-0208-3

Keywords

Navigation