Abstract
This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document retrieval tasks including visual similarity based retrieval, categorization and information extraction.
Article PDF
Similar content being viewed by others
References
Baird HS (1994) Background structure in document images. Int Journal of Pattern Recognition and Artificial intelligence, 8(5):1013–1030.
Cullen JF, Hull JJ and Hart PE (1997) Document image database retrieval and browsing using texture analysis. In: Proc. ICDAR'97, Ulm, Germany, pp. 718–721.
Dengel A and Dubiel F (1996) Computer understanding of document structure. Int Journal of Imaging Systems and Technology, 7:271–278.
Doermann D (1997) The retrieval of document images: a brief survey. In: Proc. ICDAR'97, Ulm, Germany, pp. 945–949.
Doermann D, Li H and Kia D (1997) The detection of duplicates in document image databases. In: ICDAR'97, Ulm, Germany, pp. 314–318.
Ferguson JD (1980) Variable duration models for speech. In: Proc. Symp. on the Application of HMM to Text and Speech, Priceton, NJ, pp. 143–179.
Gersho A and Gray RM (1992) Vector Quantization and Signal Compression. Kluwer Academic Publishers.
Hu J, Brown MK and Turin W (1996) HMM based on-line handwriting recognition. IEEE PAMI, 18(10):1039–1045.
Hull JJ and Cullen JF (1997) Document image similarity and equivalence detection. In: ICDAR'97, Ulm, Germany, pp. 308–312.
Kashi R, Hu J, Nelson W and Turin W (1997). On-line handwriting signature verification using hidden Markov model features. In: Proc. ICDAR'97, Ulm, Germany.
Kruskal JB and Sankoff D (1993), Eds. TimeWarps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA.
Levinson SE (1986) Continuously variable duration hidden Markov models for automatic speech recognition. Computer Speech & Language, 1(1):29–45.
Rabiner LR and Juang BH (1993) Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs, NJ.
Sakoe H and Chiba S (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust., Speech, Signal Processing, ASSP-26:43–49.
Taylor SL, Lipshutz M and Nilson RW(1995) Classification and functional decomposition of business documents. In: Proc. ICDAR'95, Montreal, Canada, pp. 563–566.
Turin W (1990) Performance Analysis of Digital Transmission Systems. Computer Science Press, New York.
Walischewski H (1997) Automatic knowledge acquisition for spatial document interpretation. In: ICDAR'97, Ulm, Germany, pp. 243–247.
Zhu W and Syeda-Mahmood T (1998) Image organization and retrieval using a flexible shape model. In: IEEE Int. Workshop on Content Based Access of Image and Video Databases, Bombay, India, pp. 31–39.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Hu, J., Kashi, R. & Wilfong, G. Comparison and Classification of Documents Based on Layout Similarity. Information Retrieval 2, 227–243 (2000). https://doi.org/10.1023/A:1009910911387
Issue Date:
DOI: https://doi.org/10.1023/A:1009910911387