Skip to main content
Log in

A survey of document image classification: problem statement, classifier architecture and performance evaluation

  • ORIGINAL PAPER
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Appiani E., Cesarini F., Colla A.M., Diligenti M., Gori M., Marinai S., Soda G. (2001). Automatic document classification and indexing in high-volume applications. Int. J. Doc. Anal. Recognit. 4(2): 69–83

    Article  Google Scholar 

  2. Bagdanov A.D., Worring M. (2003). First order Gaussian graphs for efficient structure classification. Pattern Recognit. 36(6): 1311–1324

    Article  MATH  Google Scholar 

  3. Bagdanov, A.D., Worring, M.: Fine-grained document genre classification using first order random graphs. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, USA, 10–13 September 2001, pp. 79–90 (2001)

  4. Baumann, S., Ali, M., Dengel, A., Jäger, T., Malburg, M., Weigel, A., Wenzel, C.: Message extraction from printed documents – a complete solution. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, 18–20 August 1997, pp. 1055–1059 (1997)

  5. Baldi, S., Marinai, S., Soda, G.: Using tree-grammars for training set expansion in page classification. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 829–833 (2003)

  6. Bengio Y., Frasconi P. (1995). An input output HMM architecture. In: Tesauro G., Touretzky D., Leen T. (eds) Advances in Neural Information Processing Systems, vol. 7. MIT, Cambridge, pp. 427–434

    Google Scholar 

  7. Blum, A.: On-line algorithms in machine learning. In: Fiat, A., Woeginger, G. (eds.) Online algorithms: the state of the art, vol. 1442, pp. 306–325. Springer, Berlin Heidelberg New York (1998)

  8. Brükner, T., Suda, P., Block, H., Maderlechner, G.: In-house mail distribution by automatic address and content interpretation. In: Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, April 1996, pp. 67–75 (1996)

  9. Bunke, H.: Recent developments in graph matching. In: Proceedings of the 15th International Conference on Pattern Recognition, Barcelona, Spain, 3–8 September 2000, vol. 2, pp. 2117–2124 (2000)

  10. Byun, Y., Lee, Y.: Form classification using DP matching. In: Proceedings of the 2000 ACM Symposium on Applied Computing, Como, Italy, 19–21 March 2000, pp. 1–4 (2000)

  11. Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, 1994, pp. 161–175 (1994)

  12. Cesarini, F., Lastri, M., Marinai, S., Soda, G.: Encoding of modified X–Y trees for document classification. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, USA, 10–13 September 2001, pp. 1131–1136 (2001)

  13. Dengel, A., Bleisinger, R., Fein, F., Hoch, R., Hönes, F., Malburg, M.: OfficeMAID – a system for office mail analysis, interpretation and delivery. In: Proceedings of International Association for Pattern Recognition Workshop on Document Analysis Systems, Kaiserslautern, Germany, October 1994, pp. 253–275 (1994)

  14. Dengel, A., Dubiel, F.: Clustering and classification of document structure – a machine learning approach. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, 14–15 August 1995, pp. 587–591 (1995)

  15. Dengel, A.: Bridging the media gap from the Guthenberg’s world to electronic document management systems. In: Proceedings of 1997 IEEE International Conference on Systems, Man, and Cybernetics, Orlando, Florida, USA, October 1997, pp. 3540–3554 (1997)

  16. Diligenti M., Frasconi P., Gori M. (2003). Hidden Tree Markov Models for document image classification. IEEE Trans. Pattern Anal. Mach. Intell. 25(4): 519–523

    Article  Google Scholar 

  17. Doermann D., Rivlin E., Rosenfeld A. (1998). The function of documents. Int. J. Comput. Vision 16(11): 799–814

    Article  Google Scholar 

  18. Duda R., Hart P., Stork D. (2001). Pattern Classification, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  19. Eglin, V., Bres, S.: Document page similarity based on layout visual saliency: application to query by example and document classification. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 1208–1212 (2003)

  20. Eglin V., Bres S. (2004). Analysis and interpretation of visual saliency for document functional labeling. Int. J. Doc. Anal. Recognit. 7(1): 28–43

    Google Scholar 

  21. Esposito F., Malerba D., Lisi F.A. (2000). Machine learning for intelligent processing of printed documents. J. Intell. Inf. Syst. 14(2–3): 175–198

    Article  Google Scholar 

  22. Haralick, R.: Document image understanding: geometric and logical layout. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, 20–24 June 1994, pp. 385–390 (1994)

  23. Héroux, P., Diana, S., Ribert, A., Trupin, E.: Classification method study for automatic form class identification. In: Proceedings of the 14th International Conference on Pattern Recognition, Brisbane, Australia, 16–20 August 1998, pp. 926–929 (1998)

  24. Hoch, R.: Using IR techniques for text classification in document analysis. In: Proceedings of the 17th International ACM-SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 1994, pp. 31–40 (1994)

  25. Ho T.K. (2002). Multiple classifier combination: lessons and next steps. In: Kandel A., Bunke H. (eds) Hybrid Methods in Pattern Recognition. World Scientific, Singapore, pp. 171–198

    Google Scholar 

  26. Hu, J., Kashi, R., Wilfong, G.: Document classification using layout analysis. In: Proceedings of the 1st International Workshop on Document Analysis and Understanding for Document Databases, Florence, Italy, September 1999, pp. 556–560 (1999)

  27. Huang X.D., Ariki Y., Jack M.A. (1990). Hidden Markov Models for Speech Recognition. Edinburgh University Press, Edinburgh

    Google Scholar 

  28. Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text categorization of low quality images. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, 1995, pp. 301–315 (1995)

  29. Jain A.K., Duin P.W., Mao J. (2000). Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1): 4–37

    Article  Google Scholar 

  30. Junker, M., Hoch, R.: Evaluating OCR and non-OCR text representation for learning document classifiers. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, 18–20 August 1997, pp. 1060–1066 (1997)

  31. Kochi, T., Saitoh, T.: User-defined template for identifying document type and extracting information from documents. In: Proceedings of the 5th International Conference on Document Analysis and Recognition, Bangalore, India, 20–22 September 1999, pp. 127–130 (1999)

  32. Kopec G.E., Chou P.A. (1994). Document image decoding using Markov source models. IEEE Trans. Pattern Anal. Mach. Intell. 16(6): 602–617

    Article  Google Scholar 

  33. Lam, S.: An adaptive approach to document classification and understanding. In: Proceedings of International Association for Pattern Recognition Workshop on Document Analysis Systems, Kaiserslautern, Germany, October 1994, pp. 231–251 (1994)

  34. Liang, J., Doermann, D., Ma, M., Guo, J.K.: Page classification through logical labelling. In: Proceedings of the 16th International Conference on Pattern Recognition, Quebec, Canada, 11–15 August 2002, pp. 477–480 (2002)

  35. Littlestone N. (1988). Learning quickly when irrelevant attributes abound: a new linear threshold algorithm. Mach. Learn. 2(4): 285–318

    Google Scholar 

  36. Maderlechner G., Suda P., Brückner T. (1997). Classification of documents by form and content. Pattern Recognit. Lett. 18(11–13): 1225–1231

    Article  Google Scholar 

  37. Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. In: Proceedings of Document Recognition and Retrieval X (IS&T/SPIE electronic imaging), Santa Clara, California, USA, 20–24 January 2003, SPIE Proceedings Series 5010, 197–207 (2003)

  38. Nagy G. (2000). Twenty years of document image analysis in PAMI. IEEE Tran. Pattern Anal. Mach. Intell. 22(1): 38–62

    Article  MathSciNet  Google Scholar 

  39. Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: Proceedings of the 7th International Conference on Pattern Recognition, Los Alamitos, California, USA, 1984, pp. 347–349 (1984)

  40. Nattee, C., Numao, M.: Geometric method for document understanding and classification using on-line machine learning. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, USA, 10–13 September 2001, pp. 602–606 (2001)

  41. Ogata, H., Watanabe, S., Imaizumi, A., Yasue, T., Furukawa, N., Sako, H., Fujisawa, H.: Form type identification for banking applications and its implementation issues. In: Proceedings of Document Recognition and Retrieval X (IS&T/SPIE electronic imaging), Santa Clara, California, 20–24 January 2003, SPIE Proceedings Series 5010, 208–218 (2003)

  42. Okun, O., Doermann, D., Pietikäinen, M.: Page segmentation and zone classification: the state of the art. Technical report, LAMP-TR-036, University of Maryland, College Park (1999)

  43. Pavlidis T. (1980). Structural pattern recognition, 2nd edn. Springer, Berlin Heidelberg New York

    Google Scholar 

  44. Phillips, I.T., Chen, S., Haralick, R.: CD-ROM document database standard. In: Proceedings of the 2nd International Conference on Document Analysis and Recognition, Tsukuba, Japan, 20–22 October 1993, pp. 478–483 (1993)

  45. Quinlan R. (1993). C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Mateo, CA

    Google Scholar 

  46. Sako, H., Seki, M., Furukawa, N., Ikeda, H., Imaizumi, A.: Form reading based on form-type identification and form-data recognition. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 926–930 (2003)

  47. Sauvola, J., Kauniskangas, H.: MediaTeam document database (http://www.mediateam.oulu.fi/MTDB/), Oulu University, Finland (1999)

  48. Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of web documents using a graph model. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 240–244 (2003)

  49. Sebastiani F. (2002). Machine learning in automated text categorization. ACM Comput. Surveys 34(1): 1–47

    Article  Google Scholar 

  50. Shimotsuji, S., Asano, M.: Form identification based on cell structure. In: Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, August 1996, vol. C, pp. 793–797 (1996)

  51. Shin C., Doermann D., Rosenfeld A. (2001). Classification of document pages using structure-based features. Int. J. Doc. Anal. Recognit. 3(4): 232–247

    Article  Google Scholar 

  52. Smith E.B., Monn D., Veeramachaneni H., Kise K., Malizia A., Todoran L., El-Nasan A., Ingold R. (2004). Reports of the DAS02 working group. Int. J. Doc. Anal. Recognit. 6(3): 211–217

    Google Scholar 

  53. Spitz, A.L., Maghbouleh, A.: Text categorization using character shape codes. In: Proceedings of Document Recognition and Retrieval VII (IS&T/SPIE electronic imaging), San Jose, California, 23–28 January 2000, SPIE Proceedings Series 3967, 174–181 (2000)

  54. Tang, Y.Y., Cheriet, M., Liu, J., Said, J.N., Suen, C.Y.: Document analysis and recognition by computers. In: Handbook of Pattern Recognition and Computer Vision, 2nd edn. World Scientific, Singapore, pp. 579–612 (1998)

  55. Taylor, S., Lipshutz, M., Nilson, R.: Classification and functional decomposition of business documents. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, 14–15 August 1995, pp. 563–566 (1995)

  56. Ting, A., Leung, M.: Business form classification using strings. In: Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, August 1996, vol. B, pp. 690–694 (1996)

  57. Trier D., Jain A.K., Taxt T. (1996). Feature extraction methods for character recognition – a survey. Pattern Recognit. 29(4): 641–662

    Article  Google Scholar 

  58. Wang, Y., Phillips, I.T., Haralick, R.: A study on the document zone content classification problem. In: Proceedings of the 5th International Workshop on Document Analysis Systems, Princeton, NJ, USA, 19–21 August 2002, pp. 212–223 (2002)

  59. Watanabe T., Luo Q., Sugie N. (1995). Layout recognition of multi-kinds of table-form documents. IEEE Trans. Pattern Anal. Mach. Intell. 17(4): 432–445

    Article  Google Scholar 

  60. Watanabe, T.: A guideline for specifying layout knowledge. In: Proceedings of Document Recognition and Retrieval VI (IS&T/SPIE electronic imaging), San Jose, CA, 27 January 1999, SPIE Proceedings Series 3651, 162–172 (1999)

  61. Wenzel, C., Baumann, S., Jäger, T.: Advances in document classification by voting of competitive approaches. In: Proceedings of International Association for Pattern Recognition Workshop on Document Analysis Systems, Malvern, Pennsylvania, October, 1996, pp. 352–372 (1996)

  62. Wnek, J.: Learning to identify hundreds of flex-form documents. In: Proceedings of Document Recognition and Retrieval VI (IS&T/SPIE electronic imaging), San Jose, CA, 27 January 1999, SPIE Proceedings Series 3651, 173–182 (1999)

  63. Wong A.K.C., Constant J., You M.L. (1990). Random graphs. In: Bunke H., Sanfeliu A. (eds) Syntactic and Structural Pattern Recognition: Theory and Applications. World Scientific, Singapore, pp. 197–236

    Google Scholar 

  64. Zhang K., Shasha D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6): 1245–1262

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nawei Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, N., Blostein, D. A survey of document image classification: problem statement, classifier architecture and performance evaluation. IJDAR 10, 1–16 (2007). https://doi.org/10.1007/s10032-006-0020-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-006-0020-2

Keywords

Navigation