A survey of document image classification: problem statement, classifier architecture and performance evaluation

Chen, Nawei; Blostein, Dorothea

doi:10.1007/s10032-006-0020-2

A survey of document image classification: problem statement, classifier architecture and performance evaluation

ORIGINAL PAPER
Published: 03 August 2006

Volume 10, pages 1–16, (2007)
Cite this article

International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Nawei Chen¹ &
Dorothea Blostein¹

1556 Accesses
106 Citations
12 Altmetric
Explore all metrics

Abstract

Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-Time Application of Document Classification Based on Machine Learning

Classification Methods in Image Analysis with a Special Focus on Medical Analytics

Semantic Classifier Approach to Document Classification

References

Appiani E., Cesarini F., Colla A.M., Diligenti M., Gori M., Marinai S., Soda G. (2001). Automatic document classification and indexing in high-volume applications. Int. J. Doc. Anal. Recognit. 4(2): 69–83
Article Google Scholar
Bagdanov A.D., Worring M. (2003). First order Gaussian graphs for efficient structure classification. Pattern Recognit. 36(6): 1311–1324
Article MATH Google Scholar
Bagdanov, A.D., Worring, M.: Fine-grained document genre classification using first order random graphs. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, USA, 10–13 September 2001, pp. 79–90 (2001)
Baumann, S., Ali, M., Dengel, A., Jäger, T., Malburg, M., Weigel, A., Wenzel, C.: Message extraction from printed documents – a complete solution. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, 18–20 August 1997, pp. 1055–1059 (1997)
Baldi, S., Marinai, S., Soda, G.: Using tree-grammars for training set expansion in page classification. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 829–833 (2003)
Bengio Y., Frasconi P. (1995). An input output HMM architecture. In: Tesauro G., Touretzky D., Leen T. (eds) Advances in Neural Information Processing Systems, vol. 7. MIT, Cambridge, pp. 427–434
Google Scholar
Blum, A.: On-line algorithms in machine learning. In: Fiat, A., Woeginger, G. (eds.) Online algorithms: the state of the art, vol. 1442, pp. 306–325. Springer, Berlin Heidelberg New York (1998)
Brükner, T., Suda, P., Block, H., Maderlechner, G.: In-house mail distribution by automatic address and content interpretation. In: Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, April 1996, pp. 67–75 (1996)
Bunke, H.: Recent developments in graph matching. In: Proceedings of the 15th International Conference on Pattern Recognition, Barcelona, Spain, 3–8 September 2000, vol. 2, pp. 2117–2124 (2000)
Byun, Y., Lee, Y.: Form classification using DP matching. In: Proceedings of the 2000 ACM Symposium on Applied Computing, Como, Italy, 19–21 March 2000, pp. 1–4 (2000)
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, 1994, pp. 161–175 (1994)
Cesarini, F., Lastri, M., Marinai, S., Soda, G.: Encoding of modified X–Y trees for document classification. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, USA, 10–13 September 2001, pp. 1131–1136 (2001)
Dengel, A., Bleisinger, R., Fein, F., Hoch, R., Hönes, F., Malburg, M.: OfficeMAID – a system for office mail analysis, interpretation and delivery. In: Proceedings of International Association for Pattern Recognition Workshop on Document Analysis Systems, Kaiserslautern, Germany, October 1994, pp. 253–275 (1994)
Dengel, A., Dubiel, F.: Clustering and classification of document structure – a machine learning approach. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, 14–15 August 1995, pp. 587–591 (1995)
Dengel, A.: Bridging the media gap from the Guthenberg’s world to electronic document management systems. In: Proceedings of 1997 IEEE International Conference on Systems, Man, and Cybernetics, Orlando, Florida, USA, October 1997, pp. 3540–3554 (1997)
Diligenti M., Frasconi P., Gori M. (2003). Hidden Tree Markov Models for document image classification. IEEE Trans. Pattern Anal. Mach. Intell. 25(4): 519–523
Article Google Scholar
Doermann D., Rivlin E., Rosenfeld A. (1998). The function of documents. Int. J. Comput. Vision 16(11): 799–814
Article Google Scholar
Duda R., Hart P., Stork D. (2001). Pattern Classification, 2nd edn. Wiley, New York
MATH Google Scholar
Eglin, V., Bres, S.: Document page similarity based on layout visual saliency: application to query by example and document classification. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 1208–1212 (2003)
Eglin V., Bres S. (2004). Analysis and interpretation of visual saliency for document functional labeling. Int. J. Doc. Anal. Recognit. 7(1): 28–43
Google Scholar
Esposito F., Malerba D., Lisi F.A. (2000). Machine learning for intelligent processing of printed documents. J. Intell. Inf. Syst. 14(2–3): 175–198
Article Google Scholar
Haralick, R.: Document image understanding: geometric and logical layout. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, 20–24 June 1994, pp. 385–390 (1994)
Héroux, P., Diana, S., Ribert, A., Trupin, E.: Classification method study for automatic form class identification. In: Proceedings of the 14th International Conference on Pattern Recognition, Brisbane, Australia, 16–20 August 1998, pp. 926–929 (1998)
Hoch, R.: Using IR techniques for text classification in document analysis. In: Proceedings of the 17th International ACM-SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 1994, pp. 31–40 (1994)
Ho T.K. (2002). Multiple classifier combination: lessons and next steps. In: Kandel A., Bunke H. (eds) Hybrid Methods in Pattern Recognition. World Scientific, Singapore, pp. 171–198
Google Scholar
Hu, J., Kashi, R., Wilfong, G.: Document classification using layout analysis. In: Proceedings of the 1st International Workshop on Document Analysis and Understanding for Document Databases, Florence, Italy, September 1999, pp. 556–560 (1999)
Huang X.D., Ariki Y., Jack M.A. (1990). Hidden Markov Models for Speech Recognition. Edinburgh University Press, Edinburgh
Google Scholar
Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text categorization of low quality images. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, 1995, pp. 301–315 (1995)
Jain A.K., Duin P.W., Mao J. (2000). Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1): 4–37
Article Google Scholar
Junker, M., Hoch, R.: Evaluating OCR and non-OCR text representation for learning document classifiers. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, 18–20 August 1997, pp. 1060–1066 (1997)
Kochi, T., Saitoh, T.: User-defined template for identifying document type and extracting information from documents. In: Proceedings of the 5th International Conference on Document Analysis and Recognition, Bangalore, India, 20–22 September 1999, pp. 127–130 (1999)
Kopec G.E., Chou P.A. (1994). Document image decoding using Markov source models. IEEE Trans. Pattern Anal. Mach. Intell. 16(6): 602–617
Article Google Scholar
Lam, S.: An adaptive approach to document classification and understanding. In: Proceedings of International Association for Pattern Recognition Workshop on Document Analysis Systems, Kaiserslautern, Germany, October 1994, pp. 231–251 (1994)
Liang, J., Doermann, D., Ma, M., Guo, J.K.: Page classification through logical labelling. In: Proceedings of the 16th International Conference on Pattern Recognition, Quebec, Canada, 11–15 August 2002, pp. 477–480 (2002)
Littlestone N. (1988). Learning quickly when irrelevant attributes abound: a new linear threshold algorithm. Mach. Learn. 2(4): 285–318
Google Scholar
Maderlechner G., Suda P., Brückner T. (1997). Classification of documents by form and content. Pattern Recognit. Lett. 18(11–13): 1225–1231
Article Google Scholar
Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. In: Proceedings of Document Recognition and Retrieval X (IS&T/SPIE electronic imaging), Santa Clara, California, USA, 20–24 January 2003, SPIE Proceedings Series 5010, 197–207 (2003)
Nagy G. (2000). Twenty years of document image analysis in PAMI. IEEE Tran. Pattern Anal. Mach. Intell. 22(1): 38–62
Article MathSciNet Google Scholar
Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: Proceedings of the 7th International Conference on Pattern Recognition, Los Alamitos, California, USA, 1984, pp. 347–349 (1984)
Nattee, C., Numao, M.: Geometric method for document understanding and classification using on-line machine learning. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, USA, 10–13 September 2001, pp. 602–606 (2001)
Ogata, H., Watanabe, S., Imaizumi, A., Yasue, T., Furukawa, N., Sako, H., Fujisawa, H.: Form type identification for banking applications and its implementation issues. In: Proceedings of Document Recognition and Retrieval X (IS&T/SPIE electronic imaging), Santa Clara, California, 20–24 January 2003, SPIE Proceedings Series 5010, 208–218 (2003)
Okun, O., Doermann, D., Pietikäinen, M.: Page segmentation and zone classification: the state of the art. Technical report, LAMP-TR-036, University of Maryland, College Park (1999)
Pavlidis T. (1980). Structural pattern recognition, 2nd edn. Springer, Berlin Heidelberg New York
Google Scholar
Phillips, I.T., Chen, S., Haralick, R.: CD-ROM document database standard. In: Proceedings of the 2nd International Conference on Document Analysis and Recognition, Tsukuba, Japan, 20–22 October 1993, pp. 478–483 (1993)
Quinlan R. (1993). C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Mateo, CA
Google Scholar
Sako, H., Seki, M., Furukawa, N., Ikeda, H., Imaizumi, A.: Form reading based on form-type identification and form-data recognition. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 926–930 (2003)
Sauvola, J., Kauniskangas, H.: MediaTeam document database (http://www.mediateam.oulu.fi/MTDB/), Oulu University, Finland (1999)
Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of web documents using a graph model. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 240–244 (2003)
Sebastiani F. (2002). Machine learning in automated text categorization. ACM Comput. Surveys 34(1): 1–47
Article Google Scholar
Shimotsuji, S., Asano, M.: Form identification based on cell structure. In: Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, August 1996, vol. C, pp. 793–797 (1996)
Shin C., Doermann D., Rosenfeld A. (2001). Classification of document pages using structure-based features. Int. J. Doc. Anal. Recognit. 3(4): 232–247
Article Google Scholar
Smith E.B., Monn D., Veeramachaneni H., Kise K., Malizia A., Todoran L., El-Nasan A., Ingold R. (2004). Reports of the DAS02 working group. Int. J. Doc. Anal. Recognit. 6(3): 211–217
Google Scholar
Spitz, A.L., Maghbouleh, A.: Text categorization using character shape codes. In: Proceedings of Document Recognition and Retrieval VII (IS&T/SPIE electronic imaging), San Jose, California, 23–28 January 2000, SPIE Proceedings Series 3967, 174–181 (2000)
Tang, Y.Y., Cheriet, M., Liu, J., Said, J.N., Suen, C.Y.: Document analysis and recognition by computers. In: Handbook of Pattern Recognition and Computer Vision, 2nd edn. World Scientific, Singapore, pp. 579–612 (1998)
Taylor, S., Lipshutz, M., Nilson, R.: Classification and functional decomposition of business documents. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, 14–15 August 1995, pp. 563–566 (1995)
Ting, A., Leung, M.: Business form classification using strings. In: Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, August 1996, vol. B, pp. 690–694 (1996)
Trier D., Jain A.K., Taxt T. (1996). Feature extraction methods for character recognition – a survey. Pattern Recognit. 29(4): 641–662
Article Google Scholar
Wang, Y., Phillips, I.T., Haralick, R.: A study on the document zone content classification problem. In: Proceedings of the 5th International Workshop on Document Analysis Systems, Princeton, NJ, USA, 19–21 August 2002, pp. 212–223 (2002)
Watanabe T., Luo Q., Sugie N. (1995). Layout recognition of multi-kinds of table-form documents. IEEE Trans. Pattern Anal. Mach. Intell. 17(4): 432–445
Article Google Scholar
Watanabe, T.: A guideline for specifying layout knowledge. In: Proceedings of Document Recognition and Retrieval VI (IS&T/SPIE electronic imaging), San Jose, CA, 27 January 1999, SPIE Proceedings Series 3651, 162–172 (1999)
Wenzel, C., Baumann, S., Jäger, T.: Advances in document classification by voting of competitive approaches. In: Proceedings of International Association for Pattern Recognition Workshop on Document Analysis Systems, Malvern, Pennsylvania, October, 1996, pp. 352–372 (1996)
Wnek, J.: Learning to identify hundreds of flex-form documents. In: Proceedings of Document Recognition and Retrieval VI (IS&T/SPIE electronic imaging), San Jose, CA, 27 January 1999, SPIE Proceedings Series 3651, 173–182 (1999)
Wong A.K.C., Constant J., You M.L. (1990). Random graphs. In: Bunke H., Sanfeliu A. (eds) Syntactic and Structural Pattern Recognition: Theory and Applications. World Scientific, Singapore, pp. 197–236
Google Scholar
Zhang K., Shasha D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6): 1245–1262
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Queen’s University, K7L 3N6, Kingston, ON, Canada
Nawei Chen & Dorothea Blostein

Authors

Nawei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Dorothea Blostein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nawei Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, N., Blostein, D. A survey of document image classification: problem statement, classifier architecture and performance evaluation. IJDAR 10, 1–16 (2007). https://doi.org/10.1007/s10032-006-0020-2

Download citation

Received: 01 June 2004
Accepted: 20 December 2005
Published: 03 August 2006
Issue Date: June 2007
DOI: https://doi.org/10.1007/s10032-006-0020-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of document image classification: problem statement, classifier architecture and performance evaluation

Abstract

Access this article

Similar content being viewed by others

Real-Time Application of Document Classification Based on Machine Learning

Classification Methods in Image Analysis with a Special Focus on Medical Analytics

Semantic Classifier Approach to Document Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey of document image classification: problem statement, classifier architecture and performance evaluation

Abstract

Access this article

Similar content being viewed by others

Real-Time Application of Document Classification Based on Machine Learning

Classification Methods in Image Analysis with a Special Focus on Medical Analytics

Semantic Classifier Approach to Document Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation