Automatic name extraction from degraded document images

Likforman-Sulem, Laurence; Vaillant, Pascal; de Bodard de la Jacopière, Aliette

doi:10.1007/s10044-006-0038-6

Automatic name extraction from degraded document images

Theoretical Advances
Published: 26 August 2006

Volume 9, pages 211–227, (2006)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Laurence Likforman-Sulem¹,
Pascal Vaillant¹^nAff2 &
Aliette de Bodard de la Jacopière¹

127 Accesses
5 Citations
Explore all metrics

Abstract

The problem addressed in this paper is the automatic extraction of names from a document image. Our approach relies on the combination of two complementary analyses. First, the image-based analysis exploits visual clues to select the regions of interest in the document. Second, the textual-based analysis searches for name patterns and low-level word textual features. Both analyses are then combined at the word level through a neural network fusion scheme. Reported results on degraded documents such as facsimile and photocopied technical journals demonstrate the interest of the combined approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel text representation which enables image classifiers to also simultaneously classify text, applied to name disambiguation

Article Open access 05 June 2023

Text Extraction and Restoration of Old Handwritten Documents

Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK

References

Vinot R, Yvon F (2001) Semi-automatic response in a Mail Center. In: Proceedings of the 10th international symposium on applied stochastic models and data analysis. ASMDA 2001, Compiègne (France), pp 992–997
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail, 6th conference on empirical methods in natural language processing, Carnegie Mellon University, Pittsburgh, pp 44–50
Gravier G, Yvon F, Ettore G, Chollet G (1997) Directory name retrieval using HMM modelling and robust lexical access. In: Proceedings of the IEEE Workshop on automatic speech recognition and understanding, Santa Barbara
Leibowitz-Taylor S, Fritzon R, Pastor JA (1992) Extraction of data from preprinted forms. Mach Vis Appl 5(3):211–222
Google Scholar
Casey R, Ferguson D, Mohiuddin K, Walach E (1992) Intelligent forms processing system. Mach Vis Appl 5(3):141–155
Google Scholar
Koch G, Heutte L, Paquet T (2005) Automatic extraction of numerical sequences in handwritten incoming mail documents. Pattern Recogn Lett 26:1118–1127
Article Google Scholar
Baumann S, Ali M, Dengel A, Jäger T, Malburg M, Weigel A, Wenzel C (1997) Message extraction from printed documents: a complete solution, 4th ICDAR. Ulm (Germany), pp 1055–1059
Cesarini F, Gori M, Marinai S, Soda G (1998) INFORMys : a flexible invoice-like form reader system. IEEE PAMI 20(7):730–745
Google Scholar
Cesarini F, Francesconi E, Gori M, Soda G (2003) Analysis and understanding of multi-class invoices. IJDAR 6:102–104
Article Google Scholar
Liang J, Doermann D (2002) Logical Labeling of Document Images using layout graph matching with adaptive learning. In: Lopresti D, Hu J, Kashi R (eds) DAS, Princeton, pp 224–235
Dengel A, Barth G (1988) High level document analysis guided by geometric aspects. IJPR 2(4):641–655
Google Scholar
Kim J, Le DX, Thoma GR (2001) Automatic labeling in document images. In: IS&T/SPIE conference on document recognition and retrieval VIII, San Jose, pp 111–122
Lin X (2005) DDR research beyond COTS OCR software: a survey. In: IS&T/SPIE conference on document recognition and retrieval XII. San Jose, 2005, pp 16–20
De Silva GL, Hull J (1994) Proper noun detection in document images. Pattern Recogn 27(2):311–320
Article Google Scholar
Lii J, Srihari SN (1995) Location of name and address on fax cover pages, 3rd ICDAR. Montréal (Québec, Canada), pp 756–759
Alam H, Hartono R, Sugono Y, Tran T (2000) FaxAssist : an automatic routing of unconstrained fax to email location. In: IS&T/SPIE conference on document recognition and retrieval XI, San José, pp 148–156
Viola P, Rinker J, Law M (2004) Automatic fax routing. In: Proceedings of document analysis systems, DAS 2004, pp 484–495
Faure C (2000) Extracting the tables of contents from the images of documents. In: Proceedings of RIAO, Paris
Klink S, Kieninger T (2001) Rule-based document structure understanding with a fuzzy combination of layout and textual features. IJDAR 4:18–26
Article Google Scholar
Xerox (1994) ScanWorX API release notes. Xerox imaging systems
Wong KY, Casey R, Wahl F (1982) Document analysis system. IBM J Res Dev 6:642–656
Google Scholar
Palumbo P, Srihari S, Soh J, Sridhar R, Demjanenko V (1992) Postal address block location in real time. Computer 25(7):34–42
Article Google Scholar
Fan K-C, Wang L-S, Tu Y-T (1998) Classification of machine printed and handwritten texts using character block layout variance. Pattern Recogn 31(9):1275–1284
Article Google Scholar
Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, Oxford
Lowe D, Webb AR (1990) Exploiting prior knowledge in network optimization: an illustration from medical prognosis. Network 1:299–323
Article Google Scholar
Faussett L (1994) Fundamentals of Neural Networks. Prentice Hall, Englewood Cliffs
Google Scholar
Bruce V, Green P, Georgeson M (2003) Visual perception: physiology, psychology and ecology. Psychology Press, Hove (East Sussex), UK
Holstege M, Inn Y, Tokuda L (1991) Visual parsing: an aid to text understanding. In: Proceedings of RIAO’91, Barcelone, pp 175–193
ABU, Association des Bibliophiles Universels, on http://www.abu.cnam.fr/
Kelk B (2003) UK English wordlist with frequency classification, version 1.0, 1 February 2003, on http://www.bckelk.uklinux.net/menu.html
Bikel D, Schwartz R, Weischedel R (1999) An algorithm that learns what’s in a Name. Mach Learn 34:1–3, 211–231
Google Scholar
Likforman-Sulem L, Chollet G, Vaillant P, Azzabou N, Blouet R, Renouard S, Mostefa D (2004) Reconnaissance de noms propres et vérification d’identité dans un système de messagerie, convention Minefi no 01.2.93.0268, Final Report, January 2004, 100 p
Askilrud ES, Haralick RM (1993) A quick guide to uw english document image database I. Department of Electrical Engineering, Department of Computer Science/Software Engineering, University of Washington
Alvarez S (2002) An exact analytical relation among recall, precision and classification accuracy in information retrieval. Technical Report, Computer Science Department, Boston College

Download references

Acknowledgements

We thank the French Ministry of the Economy, Finance and Industry (MINEFI) which has been supported this work under Grant no : 01.2.93.0268. This work could not have been possible without the competent help of François Yvon of the ENST Computer Science Department, who devoted much of his time during the first stage of this project to provide us with advice, guidance, and scientific experience. The authors also wish to thank Noura Azzabou for her assistance in the experiments.

Author information

Pascal Vaillant
Present address: Université des Antilles-Guyane, Institut d’Enseignement Supérieur de Guyane, Campus de Saint-Denis, Avenue d’Estrées, B.P. 792, 97337, Cayenne cedex, Guyane française

Authors and Affiliations

Ecole Nationale Supérieure des Télécommunications/TSI and CNRS-LTCI, 46 rue Barrault, 75013, Paris, France
Laurence Likforman-Sulem, Pascal Vaillant & Aliette de Bodard de la Jacopière

Authors

Laurence Likforman-Sulem
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Vaillant
View author publications
You can also search for this author in PubMed Google Scholar
Aliette de Bodard de la Jacopière
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laurence Likforman-Sulem.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Likforman-Sulem, L., Vaillant, P. & de Bodard de la Jacopière, A. Automatic name extraction from degraded document images. Pattern Anal Applic 9, 211–227 (2006). https://doi.org/10.1007/s10044-006-0038-6

Download citation

Received: 16 August 2005
Accepted: 03 June 2006
Published: 26 August 2006
Issue Date: October 2006
DOI: https://doi.org/10.1007/s10044-006-0038-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic name extraction from degraded document images

Abstract

Access this article

Similar content being viewed by others

A novel text representation which enables image classifiers to also simultaneously classify text, applied to name disambiguation

Text Extraction and Restoration of Old Handwritten Documents

Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic name extraction from degraded document images

Abstract

Access this article

Similar content being viewed by others

A novel text representation which enables image classifiers to also simultaneously classify text, applied to name disambiguation

Text Extraction and Restoration of Old Handwritten Documents

Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation