Skip to main content
Log in

Independent component analysis for document restoration

  • Published:
Document Analysis and Recognition Aims and scope Submit manuscript

Abstract.

We propose a novel approach to restoring digital document images, with the aim of improving text legibility and OCR performance. These are often compromised by the presence of artifacts in the background, derived from many kinds of degradations, such as spots, underwritings, and show-through or bleed-through effects. So far, background removal techniques have been based on local, adaptive filters and morphological-structural operators to cope with frequent low-contrast situations. For the specific problem of bleed-through/show-through, most work has been based on the comparison between the front and back pages. This, however, requires a preliminary registration of the two images. Our approach is based on viewing the problem as one of separating overlapped texts and then reformulating it as a blind source separation problem, approached through independent component analysis techniques. These methods have the advantage that no models are required for the background. In addition, we use the spectral components of the image at different bands, so that there is no need for registration. Examples of bleed-through cancellation and recovery of underwriting from palimpsests are provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Amari S, Cichocki A (1998) Adaptive blind signal processing - neural network approaches. Proc IEEE 86:2026-2048

    Google Scholar 

  2. Attias H (1999) Independent factor analysis. Neural Comput 11:803-851

    Google Scholar 

  3. Avi-Itzhak HI, Diep TA, Garland H (1995) High accuracy optical character recognition using neural networks with centroid dithering. IEEE Trans Patt Anal Mach Intell 17:218-224

    Google Scholar 

  4. Barros AK (2000) The independence assumption: dependent component analysis. In: Girolami M (ed) Advances in independent component analysis, chap 4. Springer, Berlin Heidelberg New York, pp 63-71

  5. Bell AJ, Sejnowski TJ (1995) An information maximization approach to blind separation and blind deconvolution. Neural Comput 7:1129-1159

    Google Scholar 

  6. Cardoso JF (1999) High-order contrasts for independent component analysis. Neural Comput 11:157-192

    Google Scholar 

  7. Dubois E, Pathak A (2001) Reduction of bleed-through in scanned manuscript documents. In: Proceedings of the IS&T conference on image processing, image quality, image capture systems, Montreal, 22-25 April 2001, pp 177-180

  8. Easton RL (2001) Text recovery from the Archimedes Palimpsest. +http://www.cis.rit.edu/+ +people/faculty/easton/k-12/exercise/index.htm+

  9. Franke K, Köppen M (2001) A computer-based system to support forensic studies on handwritten documents. Int J Doc Anal Recog 3:218-231

    Google Scholar 

  10. Govindaraju V, Srihari N (1991) Separating handwritten text from overlapping nontextual contours. In: Proceedings of the international workshop on frontiers in handwriting recognition, Chateau de Bonas, France, September 1991, pp 111-119

  11. Hyvärinen A (1999a) Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neural Netw 10:626-634

    Google Scholar 

  12. Hyvärinen A (1999b) Gaussian moments for noisy independent component analysis. IEEE Signal Process Lett 6:145-147

    Google Scholar 

  13. Hyvärinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, New York

  14. Hyvärinen A(2003) The FastICA package for MATLAB. +www.cis.hut.fi/projects/ica/fastica/+

  15. Knuth K (1998) Bayesian source separation and localization. Proc of the SPIE: Bayesian inference for inverse problems, vol 3459, San Diego, July 1998, pp 147-158

  16. Kuruoglu E, Bedini L, Paratore MT, Salerno E, Tonazzini A (2003) Source separation in astrophysical maps using independent factor analysis. Neural Netw 16(3-4):479-491

    Google Scholar 

  17. Lee SE, Press SJ (1998) Robustness of Bayesian factor analysis estimates. Commun Statist Theory Meth 27(8):1871-1893

    Google Scholar 

  18. Lee T, Lewicki M, Sejnowski T (1999) Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Comput 11:409-433

    Google Scholar 

  19. Leedham G, Varma S, Patankar A, Govindaraju V (2002) Separating text and background in degraded document images - a comparison of global thresholding techniques for multi-stage thresholding. In: Proceedings of the 8th international workshop on frontiers in handwriting recognition, Niagara on the Lake, Canada, 6-8 August 2002, pp 244-249

  20. Mohammad-Djafari A (2001) A Bayesian approach to source separation. AIP Conference proceedings 567:221-244

  21. Moulines E, Cardoso JF, Gassiat E (1997) Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In: Proceedings of the ICASSP, Munich, Germany, 21-24 April 1997, pp 3617-3620

  22. Nishida H, Suzuki T (2002) Correction show-through effects in document images by multiscale analysis. In: Proceediongs of the 16th conference on pattern recognition, Quebec City, Canada, 11-15 August 2002, pp 65-68

  23. Sharma G (2001) Show-through cancellation in scans of duplex printed documents. IEEE Trans Image Process 10(5):736-754

    Google Scholar 

  24. Tan CL, Cao R, Shen P (2002) Restoration of archival documents using a wavelet technique. IEEE Trans Patt Anal Mach Intell 24(10):1399-1404

    Google Scholar 

  25. Tonazzini A, Bedini L, Kuruoglu EE, Salerno E (2001) Blind separation of time-correlated sources from noisy data. Technical Report TR-42-2001 IEI-CNR, Pisa, Italy

  26. Tonazzini A, Bedini L, Kuruoglu EE, Salerno E (2003) Blind separation of auto-correlated images from noisy mixtures using MRF models. In: Proceedings of the 4th international symposium on independent component analysis and blind source separation, Nara, Japan, 1-4 April 2003, pp 675-680

  27. Tong L, Liu RW, Soon VC, Huang Y-F (1991) Indeterminacy and identifiability of blind identification. IEEE Trans Circuits Sys 38:499-509

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna Tonazzini.

Additional information

Received: 15 April 2003, Accepted: 17 December 2003, Published online: 22 April 2004

Correspondence to: Anna Tonazzini

This work has been partially supported by the European Commission project “Isyreadet” (http: //www.isyreadet.net), under contract IST-1999-57462

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tonazzini, A., Bedini, L. & Salerno, E. Independent component analysis for document restoration. IJDAR 7, 17–27 (2004). https://doi.org/10.1007/s10032-004-0121-8

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-004-0121-8

Keywords:

Navigation