Automatic document classification and indexing in high-volume applications

Appiani, E.; Cesarini, F.; Colla, A.M.; Diligenti, M.; Gori, M.; Marinai, S.; Soda, G.

doi:10.1007/PL00010904

Automatic document classification and indexing in high-volume applications

Published: December 2001

Volume 4, pages 69–83, (2001)
Cite this article

International Journal on Document Analysis and Recognition Aims and scope Submit manuscript

E. Appiani¹,
F. Cesarini²,
A.M. Colla¹,
M. Diligenti³,
M. Gori³,
S. Marinai² &
…
G. Soda²

313 Accesses
26 Citations
9 Altmetric
Explore all metrics

Abstract.

In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described. This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes. The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging overall; in particular, document classification results fulfill the requirements of high-volume application. Integration into production lines is under execution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author information

Authors and Affiliations

Elsag spa TRI Department, Via G. Puccini, 2, 16154 Genova, Italy; e-mail: {enrico.appiani,annamaria.colla}@elsag.it, , , , , , IT
E. Appiani & A.M. Colla
DSI, Università di Firenze, Via S. Marta, 3, 50139 Firenze, Italy; e-mail: {cesarini,simone,giovanni}@dsi.unifi.it, , , , , , IT
F. Cesarini, S. Marinai & G. Soda
DII, Università di Siena, Via Roma, 56, 53100 Siena, Italy; e-mail: {diligmic,marco}@ultrA3.dii.unisi.it, , , , , , IT
M. Diligenti & M. Gori

Authors

E. Appiani
View author publications
You can also search for this author inPubMed Google Scholar
F. Cesarini
View author publications
You can also search for this author inPubMed Google Scholar
A.M. Colla
View author publications
You can also search for this author inPubMed Google Scholar
M. Diligenti
View author publications
You can also search for this author inPubMed Google Scholar
M. Gori
View author publications
You can also search for this author inPubMed Google Scholar
S. Marinai
View author publications
You can also search for this author inPubMed Google Scholar
G. Soda
View author publications
You can also search for this author inPubMed Google Scholar

Additional information

Received March 30, 2000 / Revised June 26, 2001

Rights and permissions

Reprints and permissions

About this article

Cite this article

Appiani, E., Cesarini, F., Colla, A. et al. Automatic document classification and indexing in high-volume applications. IJDAR 4, 69–83 (2001). https://doi.org/10.1007/PL00010904

Download citation

Issue Date: December 2001
DOI: https://doi.org/10.1007/PL00010904

Keywords: Document classification – Decision tree – MXY tree – Reading strategy

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic document classification and indexing in high-volume applications

Abstract.

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The System for Efficient Indexing and Search in the Large Archives of Scanned Historical Documents

Recognition of OCR Invoice Metadata Block Types

Interactive Browsing Systems for Large Image Collections

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Automatic document classification and indexing in high-volume applications

Abstract.

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The System for Efficient Indexing and Search in the Large Archives of Scanned Historical Documents

Recognition of OCR Invoice Metadata Block Types

Interactive Browsing Systems for Large Image Collections

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now