Improving recognition accuracy on structured documents by learning structural patterns

Hévízi, Gy.; Marcinkovics, T.; Lőrincz, A.

doi:10.1007/s10044-004-0208-3

Improving recognition accuracy on structured documents by learning structural patterns

Original Article
Published: 30 March 2004

Volume 7, pages 66–76, (2004)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Gy. Hévízi¹,
T. Marcinkovics¹ &
A. Lőrincz¹

71 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, we present a probabilistic method that can improve the efficiency of document classification when applied to structured documents. The analysis of the structure of a document is the starting point of document classification. Our method is designed to augment other classification schemes and complement pre-filtering information extraction procedures to reduce uncertainties. To this end, a probabilistic distribution on the structure of XML documents is introduced. We show how to parameterise existing learning methods to describe the structure distribution efficiently. The learned distribution is then used to predict the classes of unseen documents. Novelty detection making use of the structure-based distribution function is also discussed. Demonstration on model documents and on Internet XML documents are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering XML documents by patterns

Article Open access 23 January 2015

Interactive Document Retrieval and Classification

Learning Effective XML Classifiers Based on Discriminatory Structures and Nested Content

References

Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Am 284(5):34–43
Google Scholar
Abiteboul S, Buneman P, Suciu D (2000) Data on the web: From relations to semistructured data and XML. Morgan Kaufmann, San Francisco, CA
Google Scholar
Kosala R, Van den Bussche J, Bruynooghe M, Blockeel H (2002) Information extraction in structured documents using tree automata induction. In: Elomaa T, Mannila H, Toivonen H (eds) Principles of data mining and knowledge discovery: Lecture notes in computer science, vol 2431, 6th European conference, PKDD 2002, Helsinki, Finland, August 2002, pp 299–310
Muslea I, Minton S, Knoblock CA (2001) Hierarchical wrapper induction for semistructured information sources. Auton Agent Multi-Ag 4(1–2):93–114
Google Scholar
Sahuguet A, Azavant F (2001) Building intelligent web applications using lightweight wrappers. Data Knowl Eng 36(3):283–316
Article MATH Google Scholar
Lee D, Chu WW (2000) Comparative analysis of six XML schema languages, SIGMOD Record 29(3):76–87
Google Scholar
Bertino E, Guerrini G, Mesiti M, Rivara I, Tavella C (2001) Measuring the structural similarity among XML documents and DTDs
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41
Article MATH Google Scholar
Jacquin DC (2001) Indexing a web site with a terminology oriented ontology
Jutten C, Herault J (1991) Blind separation of sources. Part I: An adaptive algorithm based on neuromimetic architecture. Signal Process 24:1–10
Article MATH Google Scholar
Comon P (1994) Independent component analysis – A new concept? Signal Process 36:287–314
Article MATH Google Scholar
Cardoso JF, Laheld B (1996) Equivalent adaptive source separation. IEEE T Signal Proces 44:3017–3030
Article Google Scholar
Bell AJ, Sejnowski TJ (1995) An information-maximization approach to blind separation and blind deconvolution. Neural Comput 7:1129–1159
CAS PubMed Google Scholar
Amari SL, Cichocki A, Yang HH (1996) A new learning algorithm for blind signal separation. Advances in neural information processing systems. Morgan Kaufmann, San Mateo, CA, pp 757–763
Chow CK, Liu CN (1968) Approximating discrete probability distributions with dependence trees. IEEE T Inform Theory 14:462–467
MATH Google Scholar
Meila M, Jordan MI (2000) Learning with mixtures of trees. J Mach Learn Res 1:1–48
Article MATH Google Scholar
Meila-Predoviciu M (1999) Learning with mixtures of trees. PhD thesis, Massachusetts Institute of Technology, January 1999
Cormen TH, Leiserson CE (1990) Introduction to algorithms. MIT Press, Cambridge, MA
ADC XML resourceshttp://xml.gsfc.nasa.gov
Examples of CMLhttp://www.xml-cml.org/examples
Cohen WW (1999) Recognizing structure in web pages using similarity queries. AAAI/IAAI, pp 59–66

Download references

Acknowledgements

This work was supported by the Hungarian National Science Foundation (Grant OTKA 32487) and by EOARD (Grant F61775–00-WE065). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the European Office of Aerospace Research and Development, Air Force Office of Scientific Research or the Air Force Research Laboratory.

Author information

Authors and Affiliations

Department of Information Systems, Eötvös University, Pázmány Péter sétány 1/C, 1117, Budapest, Hungary
Gy. Hévízi, T. Marcinkovics & A. Lőrincz

Authors

Gy. Hévízi
View author publications
You can also search for this author in PubMed Google Scholar
T. Marcinkovics
View author publications
You can also search for this author in PubMed Google Scholar
A. Lőrincz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Lőrincz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hévízi, G., Marcinkovics, T. & Lőrincz, A. Improving recognition accuracy on structured documents by learning structural patterns. Pattern Anal Applic 7, 66–76 (2004). https://doi.org/10.1007/s10044-004-0208-3

Download citation

Received: 23 October 2002
Accepted: 10 February 2004
Published: 30 March 2004
Issue Date: April 2004
DOI: https://doi.org/10.1007/s10044-004-0208-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving recognition accuracy on structured documents by learning structural patterns

Abstract

Access this article

Similar content being viewed by others

Clustering XML documents by patterns

Interactive Document Retrieval and Classification

Learning Effective XML Classifiers Based on Discriminatory Structures and Nested Content

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving recognition accuracy on structured documents by learning structural patterns

Abstract

Access this article

Similar content being viewed by others

Clustering XML documents by patterns

Interactive Document Retrieval and Classification

Learning Effective XML Classifiers Based on Discriminatory Structures and Nested Content

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation