Symbolic Learning Techniques in Paper Document Processing

Altamura, Oronzo; Esposito, Floriana; Lisi, Francesca A.; Malerba, Donato

doi:10.1007/3-540-48097-8_13

Symbolic Learning Techniques in Paper Document Processing

Oronzo Altamura³,
Floriana Esposito³,
Francesca A. Lisi³ &
…
Donato Malerba³

Conference paper
First Online: 01 January 2000

721 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1715))

Abstract

WISDOM++ is an intelligent document processing system that transforms a paper document into HTML/XML format. The main design requirement is adaptivity, which is realized through the application of machine learning methods. This paper illustrates the application of symbolic learning algorithms to the first three steps of document processing, namely document analysis, document classification and document understanding. Machine learning issues related to the application are: Efficient incremental induction of decision trees from numeric data, handling of both numeric and symbolic data in first-order rule learning, learning mutually dependent concepts. Experimental results obtained on a set of real-world documents are illustrated and commented.

Acknowledgments

The authors would like to thank Francesco De Tommaso, Dario Gerbino, Ignazio Sardella, Giacomo Sidella, Rosa Maria Spadavecchia, and Silvana Spagnoletta for their contribution to the development of WISDOM++. Thanks also to the authors of the systems C4.5 and ITI.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

O. Altamura, F. Esposito & D. Malerba (1999). WISDOM++: An Interactive and Adaptive Document Analysis System. To appear in Proc. of the 5th Int. Conf. on Document Analysis and Recognition, IEEE Computer Society Press: Los Alamitos.
Google Scholar
H.S. Baird (1987). The Skew Angle of Printed Documents. Proc. Conf. Of the Society of Photographic Scientists and Engineers, 14–21 (also in R.K.L. O’Gorman (ed.), Document Image Analysis, 204-208, IEEE Computer Society: Los Alamitos, CA, 1995).
Google Scholar
T. Bayer, U. Bohnacher, & H. Mogg-Schneider (1994). InforPortLab: An Experimental Document Analysis System. Proc. of the IAPR Workshop on Document Analysis Systems, Kaiserslautern, Germany.
Google Scholar
J.H. Connell & M. Brady. Generating and Generalizing Models of Visual Objects. Artificial Intelligence, 31, 2, 159–183, 1987.
Article Google Scholar
A. Dengel & G. Barth. ANASTASIL: A Hybrid Knowledge-based System for Document Layout Analysis. Proc. of the 6th Int. Joint Conf. on Artificial Intelligence, 1249–1254, 1989.
Google Scholar
M.A. Eshera, & K.S. Fu (1986). An Image Understanding System using Attributed Symbolic Representation and Inexact Graph-matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8, 604–618.
Article Google Scholar
F. Esposito, D. Malerba, G. Semeraro, E. Annese, &; G. Scafuro (1990). An Experimental Page Layout Recognition System for Office Document Automatic Classification: An Integrated Approach for Inductive Generalization. Proceedings of the 10th International Conference on Pattern Recognition, IEEE Computer Society Press: Los Alamitos, CA, 557–562.
Google Scholar
F. Esposito, D. Malerba, & G. Semeraro (1994). Multistrategy Learning for Document Recognition. Applied Artificial Intelligence, 8, 1, 33–84.
Article Google Scholar
F. Esposito, D. Malerba, and G. Semeraro (1995). A Knowledge-based Approach to the Layout Analysis. Proc. of the 3rd Int. Conf. on Document Analysis and Recognition, IEEE Computer Society: Los Alamitos, CA, 466–471.
Google Scholar
F. Esposito, D. Malerba, & G. Semeraro (1997). A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 5, 476–491.
Article Google Scholar
F. Esposito, D. Malerba, G. Semeraro, N. Fanizzi, & S. Ferilli (1998). Adding Machine Learning and Knowledge Intensive Techniques to a Digital Library Service. International Journal on Digital Libraries, 2, 1, 3–19.
Article Google Scholar
J.L. Fisher, S.C. Hinds, & D. P. D’Amato (1990). A Rule-based System for Document Image Segmentation. Proc. of the 10th Int. Conf. on Pattern Recognition, IEEE Computer Society Press: Los Alamitos, CA, 567–572.
Google Scholar
T. Hong, & S. N. Srihari (1997). Representing OCRed Documents in HTML. Proc. of the 4th Int. Conf. on Document Analysis and Recognition, IEEE Computer Society Press: Los Alamitos, CA, 831–834.
Google Scholar
D. Malerba, F. Esposito, G. Semeraro, &; L. De Filippis (1997). Processing Paper Documents with WISDOM. In M. Lenzerini (Ed.), AI*IA 97: Advances in Artificial Intelligence, Lecture Notes in Artificial Intelligence, Springer: Berlin, 1321, 439–442.
Google Scholar
D. Malerba, F. Esposito, G. Semeraro, & S. Caggese (1997). Handling Continuous Data in Top-down Induction of First-order Rules. In M. Lenzerini (Ed.), AI*IA 97: Advances in Artificial Intelligence, Lecture Notes in Artificial Intelligence, Springer: Berlin, 1321, 24–35.
Google Scholar
D. Malerba, F. Esposito, & F. A. Lisi (1998). Learning Recursive Theories with ATRE. In H. Prade (Ed.), Proc. of the 13th European Conf. on Artificial Intelligence, John Wiley & Sons: Chichester, UK, 435-439.
Google Scholar
D. Malerba, G. Semeraro, & E. Bellisari (1995). LEX: A KnowledgeBased System for the Layout Analysis. Proc. of the 3rd Int. Conf. on the Practical Application of Prolog, 429–443.
Google Scholar
D. Malerba, G. Semeraro, & F. Esposito (1997). A Multistrategy Approach to Learning Multiple Dependent Concepts. Chapter 4 in C., Taylor & R., Nakhaeizadeh (Eds.), Machine Learning and Statistics: The Interface, Wiley: London, United Kingdom, 87–106.
Google Scholar
G. Nagy, S. Seth & M. Viswanathan (1992). A Prototype Document Image Analysis System for Technical Journals. IEEE Computer, 25, 7, 10–22.
Google Scholar
L. O’Gorman (1992). Image and Document Processing Techniques for the RightPages Electronic Library System. Proc. of the 11th Int. Conf. on Pattern Recognition, 260–263.
Google Scholar
M. Orkin & R. Drogin (1990). Vital Statistics, McGraw Hill: New York.
Google Scholar
J. R. Quinlan (1993). C4.5: Programs for induction. Morgan Kaufmann: San Mateo, CA.
Google Scholar
J. C. Schlimmer, & D. Fisher (1986). A Case Study of Incremental Concept Induction. Proc. of the 5th Nat. Conf. on Artificial Intelligence, Morgan Kaufmann: Philadelphia, 496–501.
Google Scholar
F. Y. Shih, & S. S. Chen (1996). Adaptive Document Block Segmentation and Classification. IEEE Trans. on Systems, Man, and Cybernetics Part B, 26, 5, 797–802.
Article Google Scholar
Y. Y. Tang, C. De Yan & C. Y. Suen. Document Processing for Automatic Knowledge Acquisition. IEEE Trans. on Knowledge and Data Engineering, 6(1) (1994) 3–21.
Article Google Scholar
P. E. Utgoff (1989). Incremental Induction of Decision Trees. Machine Learning, 4, 2, 161–186.
Article Google Scholar
P. E. Utgoff (1994). An Improved Algorithm for Incremental Induction of Decision Trees. Proc. of the 11th Int. Conf. on Machine Learning, Morgan Kaufmann: San Francisco, CA.
Google Scholar
D. Wang & R.N. Srihari (1989). Classification of Newspaper Image Blocks Using Texture Analysis. Computer Vision, Graphics, and Image Processing, 47, 327–352.
Article Google Scholar
K. Y. Wong, R.G. Casey, & F. M. Wahl (1982). Document Analysis System. IBM Journal of Research Development, 26, 6, 647–656.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università degli Studi di Bari, via Orabona 4, 70126, Bari, Italy
Oronzo Altamura, Floriana Esposito, Francesca A. Lisi & Donato Malerba

Authors

Oronzo Altamura
View author publications
You can also search for this author in PubMed Google Scholar
Floriana Esposito
View author publications
You can also search for this author in PubMed Google Scholar
Francesca A. Lisi
View author publications
You can also search for this author in PubMed Google Scholar
Donato Malerba
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Bildverarbeitung und angewandte Informatik, Arno-Nitzsche-Str. 45, D-04277, Leipzig, Germany
Petra Perner
School of Electronic Engineering, Information Technology and Mathematics, University of Surrey, Guilford, GU2 5XH, UK
Maria Petrou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Altamura, O., Esposito, F., Lisi, F.A., Malerba, D. (1999). Symbolic Learning Techniques in Paper Document Processing. In: Perner, P., Petrou, M. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 1999. Lecture Notes in Computer Science(), vol 1715. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48097-8_13

Download citation

DOI: https://doi.org/10.1007/3-540-48097-8_13
Published: 24 March 2000
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66599-1
Online ISBN: 978-3-540-48097-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics