Skip to main content

Symbolic Learning Techniques in Paper Document Processing

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1715))

Abstract

WISDOM++ is an intelligent document processing system that transforms a paper document into HTML/XML format. The main design requirement is adaptivity, which is realized through the application of machine learning methods. This paper illustrates the application of symbolic learning algorithms to the first three steps of document processing, namely document analysis, document classification and document understanding. Machine learning issues related to the application are: Efficient incremental induction of decision trees from numeric data, handling of both numeric and symbolic data in first-order rule learning, learning mutually dependent concepts. Experimental results obtained on a set of real-world documents are illustrated and commented.

Acknowledgments

The authors would like to thank Francesco De Tommaso, Dario Gerbino, Ignazio Sardella, Giacomo Sidella, Rosa Maria Spadavecchia, and Silvana Spagnoletta for their contribution to the development of WISDOM++. Thanks also to the authors of the systems C4.5 and ITI.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. O. Altamura, F. Esposito & D. Malerba (1999). WISDOM++: An Interactive and Adaptive Document Analysis System. To appear in Proc. of the 5th Int. Conf. on Document Analysis and Recognition, IEEE Computer Society Press: Los Alamitos.

    Google Scholar 

  2. H.S. Baird (1987). The Skew Angle of Printed Documents. Proc. Conf. Of the Society of Photographic Scientists and Engineers, 14–21 (also in R.K.L. O’Gorman (ed.), Document Image Analysis, 204-208, IEEE Computer Society: Los Alamitos, CA, 1995).

    Google Scholar 

  3. T. Bayer, U. Bohnacher, & H. Mogg-Schneider (1994). InforPortLab: An Experimental Document Analysis System. Proc. of the IAPR Workshop on Document Analysis Systems, Kaiserslautern, Germany.

    Google Scholar 

  4. J.H. Connell & M. Brady. Generating and Generalizing Models of Visual Objects. Artificial Intelligence, 31, 2, 159–183, 1987.

    Article  Google Scholar 

  5. A. Dengel & G. Barth. ANASTASIL: A Hybrid Knowledge-based System for Document Layout Analysis. Proc. of the 6th Int. Joint Conf. on Artificial Intelligence, 1249–1254, 1989.

    Google Scholar 

  6. M.A. Eshera, & K.S. Fu (1986). An Image Understanding System using Attributed Symbolic Representation and Inexact Graph-matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8, 604–618.

    Article  Google Scholar 

  7. F. Esposito, D. Malerba, G. Semeraro, E. Annese, &; G. Scafuro (1990). An Experimental Page Layout Recognition System for Office Document Automatic Classification: An Integrated Approach for Inductive Generalization. Proceedings of the 10th International Conference on Pattern Recognition, IEEE Computer Society Press: Los Alamitos, CA, 557–562.

    Google Scholar 

  8. F. Esposito, D. Malerba, & G. Semeraro (1994). Multistrategy Learning for Document Recognition. Applied Artificial Intelligence, 8, 1, 33–84.

    Article  Google Scholar 

  9. F. Esposito, D. Malerba, and G. Semeraro (1995). A Knowledge-based Approach to the Layout Analysis. Proc. of the 3rd Int. Conf. on Document Analysis and Recognition, IEEE Computer Society: Los Alamitos, CA, 466–471.

    Google Scholar 

  10. F. Esposito, D. Malerba, & G. Semeraro (1997). A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 5, 476–491.

    Article  Google Scholar 

  11. F. Esposito, D. Malerba, G. Semeraro, N. Fanizzi, & S. Ferilli (1998). Adding Machine Learning and Knowledge Intensive Techniques to a Digital Library Service. International Journal on Digital Libraries, 2, 1, 3–19.

    Article  Google Scholar 

  12. J.L. Fisher, S.C. Hinds, & D. P. D’Amato (1990). A Rule-based System for Document Image Segmentation. Proc. of the 10th Int. Conf. on Pattern Recognition, IEEE Computer Society Press: Los Alamitos, CA, 567–572.

    Google Scholar 

  13. T. Hong, & S. N. Srihari (1997). Representing OCRed Documents in HTML. Proc. of the 4th Int. Conf. on Document Analysis and Recognition, IEEE Computer Society Press: Los Alamitos, CA, 831–834.

    Google Scholar 

  14. D. Malerba, F. Esposito, G. Semeraro, &; L. De Filippis (1997). Processing Paper Documents with WISDOM. In M. Lenzerini (Ed.), AI*IA 97: Advances in Artificial Intelligence, Lecture Notes in Artificial Intelligence, Springer: Berlin, 1321, 439–442.

    Google Scholar 

  15. D. Malerba, F. Esposito, G. Semeraro, & S. Caggese (1997). Handling Continuous Data in Top-down Induction of First-order Rules. In M. Lenzerini (Ed.), AI*IA 97: Advances in Artificial Intelligence, Lecture Notes in Artificial Intelligence, Springer: Berlin, 1321, 24–35.

    Google Scholar 

  16. D. Malerba, F. Esposito, & F. A. Lisi (1998). Learning Recursive Theories with ATRE. In H. Prade (Ed.), Proc. of the 13th European Conf. on Artificial Intelligence, John Wiley & Sons: Chichester, UK, 435-439.

    Google Scholar 

  17. D. Malerba, G. Semeraro, & E. Bellisari (1995). LEX: A Knowledge­Based System for the Layout Analysis. Proc. of the 3rd Int. Conf. on the Practical Application of Prolog, 429–443.

    Google Scholar 

  18. D. Malerba, G. Semeraro, & F. Esposito (1997). A Multistrategy Approach to Learning Multiple Dependent Concepts. Chapter 4 in C., Taylor & R., Nakhaeizadeh (Eds.), Machine Learning and Statistics: The Interface, Wiley: London, United Kingdom, 87–106.

    Google Scholar 

  19. G. Nagy, S. Seth & M. Viswanathan (1992). A Prototype Document Image Analysis System for Technical Journals. IEEE Computer, 25, 7, 10–22.

    Google Scholar 

  20. L. O’Gorman (1992). Image and Document Processing Techniques for the RightPages Electronic Library System. Proc. of the 11th Int. Conf. on Pattern Recognition, 260–263.

    Google Scholar 

  21. M. Orkin & R. Drogin (1990). Vital Statistics, McGraw Hill: New York.

    Google Scholar 

  22. J. R. Quinlan (1993). C4.5: Programs for induction. Morgan Kaufmann: San Mateo, CA.

    Google Scholar 

  23. J. C. Schlimmer, & D. Fisher (1986). A Case Study of Incremental Concept Induction. Proc. of the 5th Nat. Conf. on Artificial Intelligence, Morgan Kaufmann: Philadelphia, 496–501.

    Google Scholar 

  24. F. Y. Shih, & S. ­S. Chen (1996). Adaptive Document Block Segmentation and Classification. IEEE Trans. on Systems, Man, and Cybernetics ­ Part B, 26, 5, 797–802.

    Article  Google Scholar 

  25. Y. Y. Tang, C. De Yan & C. Y. Suen. Document Processing for Automatic Knowledge Acquisition. IEEE Trans. on Knowledge and Data Engineering, 6(1) (1994) 3–21.

    Article  Google Scholar 

  26. P. E. Utgoff (1989). Incremental Induction of Decision Trees. Machine Learning, 4, 2, 161–186.

    Article  Google Scholar 

  27. P. E. Utgoff (1994). An Improved Algorithm for Incremental Induction of Decision Trees. Proc. of the 11th Int. Conf. on Machine Learning, Morgan Kaufmann: San Francisco, CA.

    Google Scholar 

  28. D. Wang & R.N. Srihari (1989). Classification of Newspaper Image Blocks Using Texture Analysis. Computer Vision, Graphics, and Image Processing, 47, 327–352.

    Article  Google Scholar 

  29. K. Y. Wong, R.G. Casey, & F. M. Wahl (1982). Document Analysis System. IBM Journal of Research Development, 26, 6, 647–656.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Altamura, O., Esposito, F., Lisi, F.A., Malerba, D. (1999). Symbolic Learning Techniques in Paper Document Processing. In: Perner, P., Petrou, M. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 1999. Lecture Notes in Computer Science(), vol 1715. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48097-8_13

Download citation

  • DOI: https://doi.org/10.1007/3-540-48097-8_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-66599-1

  • Online ISBN: 978-3-540-48097-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics