Abstract
Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.
Similar content being viewed by others
References
Appiani E., Cesarini F., Colla A.M., Diligenti M., Gori M., Marinai S., Soda G. (2001). Automatic document classification and indexing in high-volume applications. Int. J. Doc. Anal. Recognit. 4(2): 69–83
Bagdanov A.D., Worring M. (2003). First order Gaussian graphs for efficient structure classification. Pattern Recognit. 36(6): 1311–1324
Bagdanov, A.D., Worring, M.: Fine-grained document genre classification using first order random graphs. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, USA, 10–13 September 2001, pp. 79–90 (2001)
Baumann, S., Ali, M., Dengel, A., Jäger, T., Malburg, M., Weigel, A., Wenzel, C.: Message extraction from printed documents – a complete solution. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, 18–20 August 1997, pp. 1055–1059 (1997)
Baldi, S., Marinai, S., Soda, G.: Using tree-grammars for training set expansion in page classification. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 829–833 (2003)
Bengio Y., Frasconi P. (1995). An input output HMM architecture. In: Tesauro G., Touretzky D., Leen T. (eds) Advances in Neural Information Processing Systems, vol. 7. MIT, Cambridge, pp. 427–434
Blum, A.: On-line algorithms in machine learning. In: Fiat, A., Woeginger, G. (eds.) Online algorithms: the state of the art, vol. 1442, pp. 306–325. Springer, Berlin Heidelberg New York (1998)
Brükner, T., Suda, P., Block, H., Maderlechner, G.: In-house mail distribution by automatic address and content interpretation. In: Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, April 1996, pp. 67–75 (1996)
Bunke, H.: Recent developments in graph matching. In: Proceedings of the 15th International Conference on Pattern Recognition, Barcelona, Spain, 3–8 September 2000, vol. 2, pp. 2117–2124 (2000)
Byun, Y., Lee, Y.: Form classification using DP matching. In: Proceedings of the 2000 ACM Symposium on Applied Computing, Como, Italy, 19–21 March 2000, pp. 1–4 (2000)
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, 1994, pp. 161–175 (1994)
Cesarini, F., Lastri, M., Marinai, S., Soda, G.: Encoding of modified X–Y trees for document classification. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, USA, 10–13 September 2001, pp. 1131–1136 (2001)
Dengel, A., Bleisinger, R., Fein, F., Hoch, R., Hönes, F., Malburg, M.: OfficeMAID – a system for office mail analysis, interpretation and delivery. In: Proceedings of International Association for Pattern Recognition Workshop on Document Analysis Systems, Kaiserslautern, Germany, October 1994, pp. 253–275 (1994)
Dengel, A., Dubiel, F.: Clustering and classification of document structure – a machine learning approach. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, 14–15 August 1995, pp. 587–591 (1995)
Dengel, A.: Bridging the media gap from the Guthenberg’s world to electronic document management systems. In: Proceedings of 1997 IEEE International Conference on Systems, Man, and Cybernetics, Orlando, Florida, USA, October 1997, pp. 3540–3554 (1997)
Diligenti M., Frasconi P., Gori M. (2003). Hidden Tree Markov Models for document image classification. IEEE Trans. Pattern Anal. Mach. Intell. 25(4): 519–523
Doermann D., Rivlin E., Rosenfeld A. (1998). The function of documents. Int. J. Comput. Vision 16(11): 799–814
Duda R., Hart P., Stork D. (2001). Pattern Classification, 2nd edn. Wiley, New York
Eglin, V., Bres, S.: Document page similarity based on layout visual saliency: application to query by example and document classification. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 1208–1212 (2003)
Eglin V., Bres S. (2004). Analysis and interpretation of visual saliency for document functional labeling. Int. J. Doc. Anal. Recognit. 7(1): 28–43
Esposito F., Malerba D., Lisi F.A. (2000). Machine learning for intelligent processing of printed documents. J. Intell. Inf. Syst. 14(2–3): 175–198
Haralick, R.: Document image understanding: geometric and logical layout. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, 20–24 June 1994, pp. 385–390 (1994)
Héroux, P., Diana, S., Ribert, A., Trupin, E.: Classification method study for automatic form class identification. In: Proceedings of the 14th International Conference on Pattern Recognition, Brisbane, Australia, 16–20 August 1998, pp. 926–929 (1998)
Hoch, R.: Using IR techniques for text classification in document analysis. In: Proceedings of the 17th International ACM-SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 1994, pp. 31–40 (1994)
Ho T.K. (2002). Multiple classifier combination: lessons and next steps. In: Kandel A., Bunke H. (eds) Hybrid Methods in Pattern Recognition. World Scientific, Singapore, pp. 171–198
Hu, J., Kashi, R., Wilfong, G.: Document classification using layout analysis. In: Proceedings of the 1st International Workshop on Document Analysis and Understanding for Document Databases, Florence, Italy, September 1999, pp. 556–560 (1999)
Huang X.D., Ariki Y., Jack M.A. (1990). Hidden Markov Models for Speech Recognition. Edinburgh University Press, Edinburgh
Ittner, D.J., Lewis, D.D., Ahn, D.D.: Text categorization of low quality images. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, 1995, pp. 301–315 (1995)
Jain A.K., Duin P.W., Mao J. (2000). Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1): 4–37
Junker, M., Hoch, R.: Evaluating OCR and non-OCR text representation for learning document classifiers. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, Ulm, Germany, 18–20 August 1997, pp. 1060–1066 (1997)
Kochi, T., Saitoh, T.: User-defined template for identifying document type and extracting information from documents. In: Proceedings of the 5th International Conference on Document Analysis and Recognition, Bangalore, India, 20–22 September 1999, pp. 127–130 (1999)
Kopec G.E., Chou P.A. (1994). Document image decoding using Markov source models. IEEE Trans. Pattern Anal. Mach. Intell. 16(6): 602–617
Lam, S.: An adaptive approach to document classification and understanding. In: Proceedings of International Association for Pattern Recognition Workshop on Document Analysis Systems, Kaiserslautern, Germany, October 1994, pp. 231–251 (1994)
Liang, J., Doermann, D., Ma, M., Guo, J.K.: Page classification through logical labelling. In: Proceedings of the 16th International Conference on Pattern Recognition, Quebec, Canada, 11–15 August 2002, pp. 477–480 (2002)
Littlestone N. (1988). Learning quickly when irrelevant attributes abound: a new linear threshold algorithm. Mach. Learn. 2(4): 285–318
Maderlechner G., Suda P., Brückner T. (1997). Classification of documents by form and content. Pattern Recognit. Lett. 18(11–13): 1225–1231
Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. In: Proceedings of Document Recognition and Retrieval X (IS&T/SPIE electronic imaging), Santa Clara, California, USA, 20–24 January 2003, SPIE Proceedings Series 5010, 197–207 (2003)
Nagy G. (2000). Twenty years of document image analysis in PAMI. IEEE Tran. Pattern Anal. Mach. Intell. 22(1): 38–62
Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: Proceedings of the 7th International Conference on Pattern Recognition, Los Alamitos, California, USA, 1984, pp. 347–349 (1984)
Nattee, C., Numao, M.: Geometric method for document understanding and classification using on-line machine learning. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, Seattle, USA, 10–13 September 2001, pp. 602–606 (2001)
Ogata, H., Watanabe, S., Imaizumi, A., Yasue, T., Furukawa, N., Sako, H., Fujisawa, H.: Form type identification for banking applications and its implementation issues. In: Proceedings of Document Recognition and Retrieval X (IS&T/SPIE electronic imaging), Santa Clara, California, 20–24 January 2003, SPIE Proceedings Series 5010, 208–218 (2003)
Okun, O., Doermann, D., Pietikäinen, M.: Page segmentation and zone classification: the state of the art. Technical report, LAMP-TR-036, University of Maryland, College Park (1999)
Pavlidis T. (1980). Structural pattern recognition, 2nd edn. Springer, Berlin Heidelberg New York
Phillips, I.T., Chen, S., Haralick, R.: CD-ROM document database standard. In: Proceedings of the 2nd International Conference on Document Analysis and Recognition, Tsukuba, Japan, 20–22 October 1993, pp. 478–483 (1993)
Quinlan R. (1993). C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Mateo, CA
Sako, H., Seki, M., Furukawa, N., Ikeda, H., Imaizumi, A.: Form reading based on form-type identification and form-data recognition. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 926–930 (2003)
Sauvola, J., Kauniskangas, H.: MediaTeam document database (http://www.mediateam.oulu.fi/MTDB/), Oulu University, Finland (1999)
Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of web documents using a graph model. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, Scotland, 3–6 August 2003, pp. 240–244 (2003)
Sebastiani F. (2002). Machine learning in automated text categorization. ACM Comput. Surveys 34(1): 1–47
Shimotsuji, S., Asano, M.: Form identification based on cell structure. In: Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, August 1996, vol. C, pp. 793–797 (1996)
Shin C., Doermann D., Rosenfeld A. (2001). Classification of document pages using structure-based features. Int. J. Doc. Anal. Recognit. 3(4): 232–247
Smith E.B., Monn D., Veeramachaneni H., Kise K., Malizia A., Todoran L., El-Nasan A., Ingold R. (2004). Reports of the DAS02 working group. Int. J. Doc. Anal. Recognit. 6(3): 211–217
Spitz, A.L., Maghbouleh, A.: Text categorization using character shape codes. In: Proceedings of Document Recognition and Retrieval VII (IS&T/SPIE electronic imaging), San Jose, California, 23–28 January 2000, SPIE Proceedings Series 3967, 174–181 (2000)
Tang, Y.Y., Cheriet, M., Liu, J., Said, J.N., Suen, C.Y.: Document analysis and recognition by computers. In: Handbook of Pattern Recognition and Computer Vision, 2nd edn. World Scientific, Singapore, pp. 579–612 (1998)
Taylor, S., Lipshutz, M., Nilson, R.: Classification and functional decomposition of business documents. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, 14–15 August 1995, pp. 563–566 (1995)
Ting, A., Leung, M.: Business form classification using strings. In: Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, August 1996, vol. B, pp. 690–694 (1996)
Trier D., Jain A.K., Taxt T. (1996). Feature extraction methods for character recognition – a survey. Pattern Recognit. 29(4): 641–662
Wang, Y., Phillips, I.T., Haralick, R.: A study on the document zone content classification problem. In: Proceedings of the 5th International Workshop on Document Analysis Systems, Princeton, NJ, USA, 19–21 August 2002, pp. 212–223 (2002)
Watanabe T., Luo Q., Sugie N. (1995). Layout recognition of multi-kinds of table-form documents. IEEE Trans. Pattern Anal. Mach. Intell. 17(4): 432–445
Watanabe, T.: A guideline for specifying layout knowledge. In: Proceedings of Document Recognition and Retrieval VI (IS&T/SPIE electronic imaging), San Jose, CA, 27 January 1999, SPIE Proceedings Series 3651, 162–172 (1999)
Wenzel, C., Baumann, S., Jäger, T.: Advances in document classification by voting of competitive approaches. In: Proceedings of International Association for Pattern Recognition Workshop on Document Analysis Systems, Malvern, Pennsylvania, October, 1996, pp. 352–372 (1996)
Wnek, J.: Learning to identify hundreds of flex-form documents. In: Proceedings of Document Recognition and Retrieval VI (IS&T/SPIE electronic imaging), San Jose, CA, 27 January 1999, SPIE Proceedings Series 3651, 173–182 (1999)
Wong A.K.C., Constant J., You M.L. (1990). Random graphs. In: Bunke H., Sanfeliu A. (eds) Syntactic and Structural Pattern Recognition: Theory and Applications. World Scientific, Singapore, pp. 197–236
Zhang K., Shasha D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6): 1245–1262
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, N., Blostein, D. A survey of document image classification: problem statement, classifier architecture and performance evaluation. IJDAR 10, 1–16 (2007). https://doi.org/10.1007/s10032-006-0020-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-006-0020-2