Page Similarity and Classification

Marinai, Simone

doi:10.1007/978-0-85729-859-1_7

Page Similarity and Classification

Simone Marinai³

Reference work entry
First Online: 01 January 2019

3768 Accesses
4 Citations

Abstract

Document analysis and recognition techniques address several types of documents ranging from small pieces of information such as forms to larger items such as maps. In most cases, humans are capable of discerning the type of document and therefore its function without reading the actual textual content. This is possible because the layout of one document often reflects its type. For instance, invoices are more visually similar to one another than they are to technical papers and vice versa. Two related tasks, page classification and page retrieval, are based on the analysis of the visual similarity between documents and are addressed in this chapter. These tasks are analyzed in this chapter in a unified perspective because they share several technical features and are sometimes adopted in common applications.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 549.99; Price excludes VAT (USA)

Hardcover Book: USD 549.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Altamura O, Esposito F, Malerba D (2001) Transforming paper documents into XML format with WISDOM++. Int J Doc Anal Recognit 4(1):2–17
Article Google Scholar
Appiani E, Cesarini F, Colla AM, Diligenti M, Gori M, Marinai S, Soda G (2001) Automatic document classification and indexing in high-volume applications. Int J Doc Anal Recognit 4(2):69–83
Article Google Scholar
Arlandis J, Perez-Cortes J-C, Ungria E (2009) Identification of very similar filled-in forms with a reject option. In: Proceedings of the ICDAR, Barcelona, pp 246–250
Google Scholar
Bagdanov AD, Worring M (2001) Fine-grained document genre classification using first order random graphs. In: Proceedings of the ICDAR, Seattle, pp 79–83
Google Scholar
Bagdanov AD, Worring M (2003) First order Gaussian graphs for efficient structure classification. Pattern Recognit 36(3):1311–1324
Article Google Scholar
Bagdanov AD, Worring M (2003) Multi-scale document description using rectangular granulometries. Int J Doc Anal Recognit 6:181–191
Article Google Scholar
Baldi S, Marinai S, Soda G (2003) Using tree-grammars for training set expansion in page classification. In: Proceedings of the ICDAR, Edinburgh, pp 829–833
Google Scholar
Cesarini F, Gori M, Marinai S, Soda G (1999) Structured document segmentation and representation by the modified X-Y tree. In: ICDAR, Bangalore, pp 563–566
Google Scholar
Cesarini F, Lastri M, Marinai S, Soda G (2001) Encoding of modified X-Y trees for document classification. In: Proceedings of the ICDAR, Seattle, pp 1131–1136
Google Scholar
Cesarini F, Lastri M, Marinai S, Soda G (2001) Page classification for meta-data extraction from digital collections. In: Mayr HC et al (eds) Database and expert systems applications. LNCS 2113. Springer, Berlin/New York, pp 82–91
Google Scholar
Cesarini F, Marinai S, Soda G (2002) Retrieval by layout similarity of documents represented with MXY trees. In: Lopresti D, Hu J, Kashi R (eds) International workshop on document analysis systems, Princeton. LNCS 2423. Springer, pp 353–364
Google Scholar
Chen N, Blostein D (2007) A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int J Doc Anal Recognit 10(1):1–16
Article Google Scholar
Chen F, Girgensohn A, Cooper M, Lu Y, Filby G (2012) Genre identification for office document search and browsing. Int J Doc Anal Recognit 15:167–182. doi:10.1007/s10032-011-0163-7
Article Google Scholar
Chetverikov D, Liang J, Komuves J, Haralick RM (1996) Zone classification using texture features. In: International conference on pattern recognition, Vienna, pp 676–680
Google Scholar
Collins-Thompson K, Nickolov R (2002) A clustering-based algorithm for automatic document separation. In: Proceedings of the SIGIR workshop on information retrieval and OCR, Tampere
Google Scholar
Cullen JF, Hull JJ, Hart PE (1997) Document image database retrieval and browsing using texture analysis. In: Proceedings of the ICDAR, Ulm, pp 718–721
Google Scholar
Dengel A (1993) Initial learning of document structure. In: Proceedings of the ICDAR, Tsukuba, pp 86–90
Google Scholar
Dengel A, Dubiel F (1995) Clustering and classification of document structure-a machine learning approach. In: Proceedings of the ICDAR, Montreal, pp 587–591
Google Scholar
Diligenti M, Frasconi P, Gori M (2003) Hidden Tree Markov models for document image classification. IEEE Trans Pattern Anal Mach Intell 25(4):519–523
Article Google Scholar
Doermann D (1998) The indexing and retrieval of document images: a survey. Comput Vis Image Underst 70(3):287–298
Article Google Scholar
Duygulu P, Atalay V (2002) A hierarchical representation of form documents for identification and retrieval. Int J Doc Anal Recognit 5(1):17–27
Article Google Scholar
Ford G, Thoma GR (2003) Ground truth data for document image analysis. In: Proceedings of the symposium on document image understanding and technology, Greenbelt, pp 199–205
Google Scholar
Gordo A, Valveny E (2009) A rotation invariant page layout descriptor for document classification and retrieval. In: Proceedings of the ICDAR, Barcelona, pp 481–485
Google Scholar
Gordo A, Gibert J, Valveny E, Rusi\(\mathrm{\tilde{n}}\)ol M (2010) A kernel-based approach to document retrieval. In: International workshop on document analysis systems, Boston, pp 377–384
Google Scholar
Hu J, Kashi R, Wilfong G (1999) Document image layout comparison and classification. In: Proceedings of the ICDAR, Bangalore, pp 285–288
Google Scholar
Hu J, Kashi R, Wilfong G (2000) Comparison and classification of documents based on layout similarity. Inf Retr 2:227–243
Article Google Scholar
Huang M, DeMenthon D, Doermann D, Golebiowski L (2005) Document ranking by layout relevance. In: Proceedings of the ICDAR, Seoul, pp 362–366
Google Scholar
Indermuhle E, Bunke H, Shafait F, Breuel T (2010) Text versus non-text distinction in online handwritten documents. In: SAC, Sierre, pp 3–7
Google Scholar
Ishitani Y (2000) Flexible and robust model matching based on association graph for form image understanding. Pattern Anal Appl 3(2):104–119
Article Google Scholar
Jain AK, Liu J (2000) Image-based form document retrieval. Pattern Recognit 33:503–513
Article Google Scholar
Kochi T, Saitoh T (1999) User-defined template for identifying document type and extracting information from documents. In: ICDAR, Bangalore, pp 127–130
Google Scholar
Lecerf L, Chidlovskii B (2010) Scalable indexing for layout based document retrieval and ranking. ACM Symposium on Applied Computing, Sierre, pp 28–32
Google Scholar
Lin JY, Lee C-W, Chen Z (1996) Identification of business forms using relationships between adjacency frames. MVA 9(2):56–64
Google Scholar
Mao S, Nie L, Thoma GR (2005) Unsupervised style classification of document page images. IEEE International Conference on Image Processing, Genoa, pp 510–513
Google Scholar
Marinai S (2006) A survey of document image retrieval in digital libraries. In: 9th colloque international francophone sur l’Ecrit et le document, Fribourg, pp 193–198
Google Scholar
Marinai S, Marino E, Soda G (2006) Tree clustering for layout-based document image retrieval. In: Proceedings of the international workshop on document image analysis for libraries 2006, Lyon, pp 243–253
Google Scholar
Marinai S, Marino E, Soda G (2010) Table of contents recognition for converting PDF documents in e-book formats. In: Proceedings of the 10th ACM symposium on document engineering (DocEng’10), Manchester. New York, pp 73–76
Google Scholar
Marinai S, Miotti B, Soda G (2011) Digital libraries and document image retrieval techniques: a survey. In: Biba M, Xhafa F (eds) Learning structure and schemas from documents. Volume 375 of studies in computational intelligence. Springer, Berlin/Heidelberg, pp 181–204
Chapter Google Scholar
Peng H, Long F, Chi Z, Siu W-C (2001) Document image template matching based on component block list. PRL 22:1033–1042
Article Google Scholar
Peng H, Long F, Chi Z (2003) Document image recognition based on template matching of component block projections. IEEE Trans Pattern Anal Mach Intell 25(9):1188–1192
Article Google Scholar
Perea I, Lṕez D (2004) Syntactic modeling and recognition of document image. In: SSPR&SPR, Lisbon, pp 416–424
Google Scholar
Qi X, Davison BD (2009) Web page classification: features and algorithms. ACM Comput Surv 41:12:1–12:31
Article Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34:1–47
Article Google Scholar
Shih FY, Chen SS (1996) Adaptive document block segmentation and classification. IEEE Trans SMC 26(5):797–802
Google Scholar
Shin C, Doermann DS, Rosenfeld A (2001) Classification of document pages using structure-based features. Int J Doc Anal Recognit 3(4):232–247
Article Google Scholar
Takama Y, Mitsuhashi N (2005) Visual similarity comparison for web page retrieval. In: IEEE/WIC/ACM international conference on web intelligence (WI 2005), Compiegne, pp 301–304
Google Scholar
Taylor SL, Fritzson R, Pastor JA (1992) Extraction of data from preprinted forms. MVA 5(5):211–222
Google Scholar
Taylor SL, Lipshutz M, Nilson RW (1995) Classification and functional decomposition of business documents. In: ICDAR 95, Montreal, pp 563–566
Google Scholar
Tzacheva A, El-Sonbaty Y, El-Kwae EA (2002) Document image matching using a maximal grid approach. Document Recognition and Retrieval IX, San Jose, pp 121–128
Google Scholar
van Beusekom J, Keysers D, Shafait F, Breuel TM (2006) Distance measures for layout-based document image retrieval. In: Proceedings of the international workshop on document image analysis for libraries 2006, Lyon, pp 232–242
Google Scholar
Wang JT-L, Zhang K, Jeong K, Shasha D (1994) A system for approximate tree matching. IEEE Trans Knowl Data Eng 6(4):559–571
Article Google Scholar
Wang Y, Phillips IT, Haralick RM (2006) Document zone content classification and its performance evaluation. Pattern Recognit 39:57–73
Article Google Scholar
Wei C-S, Liu Q, Wang JT-L, Ng PA (1997) Knowledge discovering for document classification using tree matching in TEXPROS. Inf Sci 100(1–4):255–310
Article Google Scholar
Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Ingegneria dell’Informazione, Università degli studi di Firenze, Firenze, Italy
Simone Marinai

Authors

Simone Marinai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simone Marinai .

Editor information

Editors and Affiliations

University of Maryland, College Park, MD, USA
David Doermann
Université de Lorraine, Nancy, France
Karl Tombre

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Marinai, S. (2014). Page Similarity and Classification. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-859-1_7

Download citation

DOI: https://doi.org/10.1007/978-0-85729-859-1_7
Published: 24 July 2019
Publisher Name: Springer, London
Print ISBN: 978-0-85729-858-4
Online ISBN: 978-0-85729-859-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics