Skip to main content
Log in

Conservative preprocessing of document images

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Many preprocessing techniques intended to normalize artifacts and clean noise induce anomalies in part due to the discretized nature of the document image and in part due to inherent ambiguity in the input image relative to the desired transformation. The potentially deleterious effects of common preprocessing methods are illustrated through a series of dramatic albeit contrived examples and then shown to affect real applications of ongoing interest to the community through three writer identification experiments conducted on Arabic handwriting. Retaining ruling lines detected by multi-line linear regression instead of repairing strokes broken by deleting ruling lines reduced the error rate by 4.5 %. Exploiting word position relative to detected rulings instead of ignoring it decreased errors by 5.5 %. Counteracting page skew by rotating extracted contours during feature extraction instead of rectifying the page image reduced the error by 1.4  %. All of these accuracy gains are shown to be statistically significant. Analogous methods are advocated for other document processing tasks as topics for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. This differs from the previous 60-writer setup because of new releases of datasets from LDC.

References

  1. Abd-Almageed, W., Kumar, J., Doermann, D.: Page rule-line removal using linear subspaces in monochromatic handwritten Arabic documents. In: Proceedings of the 12th International Conference on Document Analysis and Recognition, pp. 768–772 (2009)

  2. Abdou, I., Wong, K.: Analysis of linear interpolation schemes for bi-level image applications. IBM J. Res. Dev. 26(2), 667–680 (1982)

    Article  Google Scholar 

  3. Agfa: An Introduction to Digital Scanning. Agfa-Gevaert (1994)

  4. Arvind, K., Kumar, J., Ramakrishnan, A.: Line removal and restoration of handwritten strokes. In: Proceedings of the 7th International Conference on Computational Intelligence and Multimedia Application, pp. 208–214 (2007)

  5. Baird, H.: Document image defect models. In: Baird, H., Bunke, H., Yamamoto, K. (eds.) Structured Document Image Analysis. Springer, Berlin (1995)

    Google Scholar 

  6. Bulacu, M., Schomaker, L.: Text-independent writer identification and verification using textural and allographic features. IEEE Trans. Pattern Anal. Mach. Intell. 29, 701–717 (2007)

    Article  Google Scholar 

  7. Burns, P.: Slanted-edge MTF for digital camera and scanner analysis. In: Proceedings of the IS&T 2000 PICS Conference, pp. 135–138 (2000)

  8. Cao, H., Prasad, R., Natarajan, P.: A stroke regeneration method for cleaning rule-lines in handwritten document images. In: Procedings of the MOCR Workshop at the 10th international Conference on Document Analysis and Recognition (2009)

  9. Chen, J.: Information preserving processing of noisy handwritten document images. Ph.D. thesis, Lehigh University, Bethlehem, PA (2015)

  10. Chen, J., Cao, H., Prasad, R., Bhadwaj, A., Natarajan, P.: Gabor features for offline Arabic handwriting recognition. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 53–58. Boston (2010)

  11. Chen, J., Cheng, W., Lopresti, D.: Using perturbed handwriting to support writer identification in the presence of severe data constraints. In: Proceedings of the Document Recognition and Retrieval XVIII (IS&T/SPIE International Symposium on Electronic Imaging) (2011)

  12. Cheriet, M., Kharma, N., Liu, C., Suen, C.: Character Recognition Systems. Wiley, Hoboken (2007)

    Book  MATH  Google Scholar 

  13. Citing Feng Ping Shan Library, H.K.U.: China, Collection of Genealogies, 1239–2014. http://FamilySearch.org (2015)

  14. Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895–1923 (1998)

    Article  Google Scholar 

  15. Ding, X.: Machine printed Chinese character recognition. In: Bunke, H., Wang, P. (eds.) Handbook of Character Recognition and Document Image Analysis, 305–329. World Scientific, Singapore (1997)

    Google Scholar 

  16. Dodgson, N.: Image resampling. Technical Report. University of Cambridge (1992)

  17. Doermann, D., Tombre, K.: Handbook of Document Image and Recognition. Springer, Berlin (2014)

    Book  MATH  Google Scholar 

  18. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, Hoboken (2000)

    MATH  Google Scholar 

  19. Favata, J., Srikantan, G.: A multiple feature/resolution approach to handprinted digit and character recognition. Int. J. Image Syst. Technol. 7(4), 304–311 (1998)

    Article  Google Scholar 

  20. Fischer, A., Riesen, K., Bunke, H.: Graph similarity features for HMM-based handwriting recognition in historical documents. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, pp. 253–258 (2010)

  21. Gonzalez, R., Woods, R.: Digital Image Processing, 3rd edn. Pearson, New Jersey (2008)

    Google Scholar 

  22. Ha, T., Bunke, H.: Image processing methods for document image analysis. In: Bunke, H., Wang, P. (eds.) Handbook of Character Recognition and Document Image Analysis. World Scientific, Singapore (1997)

    Google Scholar 

  23. Hu, M.: Visual pattern recognition by moment invariants. IRE Trans. Inf. Theory 8(2), 179–187 (1962)

    Article  MATH  Google Scholar 

  24. Jung, D., Krishnamoorthy, M., Nagy, G., Shapira, A.: N-tuple features for OCR revisited. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 734–745 (1996)

    Article  Google Scholar 

  25. Khotanzad, A., Homg, Y.: Invariant image recognition by Zernike moments. IEEE Trans. Pattern Anal. Mach. Intell. 12(5), 489–497 (1990)

    Article  Google Scholar 

  26. Kmiec, M.: New optimal character recognition method based on Hu invariant moments and weighted voting. J. Appl. Comput. Sci. 19(1), 33–50 (2011)

    Google Scholar 

  27. Krishnamoorthy, M., Nagy, G., Seth, S., Viswanathan, M.: Syntactic segmentation and labeling of digitized pages from technical journals. IEEE Trans. Pattern Anal. Mach. Intell. 15(7), 737–747 (1993)

    Article  Google Scholar 

  28. Kumar, J., Doermann, D.: Fast rule-line removal using integral images and support vector machines. In: Proceedings of the 11th International Conference on Document Analysis and Recognition, pp. 584–588 (2011)

  29. Liu, C., Sako, H., Fujisawa, H.: Handwritten Chinese character recognition: alternatives to nonlinear normalization. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 524–528 (2003)

  30. Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision, pp. 1150–1157 (1999)

  31. Marinai, S.: Introduction to document analysis and recognition. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition, pp. 1–20. Springer, Berlin (2008)

    Chapter  Google Scholar 

  32. Mohamad, R.A.H., Likforman-Sulem, L., Mokbel, C.: Combining slanted-frame classifiers for improved HMM-based Arabic handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(7), 1165–1177 (2009)

    Article  Google Scholar 

  33. Nadler, M., Smith, E.: Pattern Recognition Engineering. Wiley, Hoboken (1993)

    MATH  Google Scholar 

  34. Nagy, G.: Optical scanning digitizers. IEEE Comput. 16(5), 13–24 (1983)

    Article  Google Scholar 

  35. Nagy, G.: Preprocessing document images by resampling is error prone and unnecessary. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval (2013)

  36. Natarajan, P., Lu, Z., Bazzi, I., Schwartz, R., Makhoul, J.: Multilingual machine printed OCR. Int. J. Pattern Recognit. Artif. Intell. 15(1), 43–63 (2001)

    Article  Google Scholar 

  37. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)

    Article  MathSciNet  Google Scholar 

  38. Ouyang, T., Davis, R.: Recognition of hand drawn chemical diagrams. In: Proceedings of the Association for the Advancement of Artificial Intelligence (2007)

  39. Pan, P., Zhu, Y., Sun, J., Naoi, S.: Recognizing characters with severe perspective distortion using hash tables and perspective invariants. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 548–552 (2011)

  40. Parker, J., Kenyon, R., Troxel, D.: Comparison of interpolating methods for image resampling. IEEE Trans. Med. Imaging 2(1), 1983 (1983)

    Article  Google Scholar 

  41. Rocha, J., Pavlidis, T.: Character recognition without segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 17(9), 903–909 (1995)

    Article  Google Scholar 

  42. Rowley-Brooke, R., Pitié, F., Kokaram, A.: A non-parametric framework for document bleed-through removal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2954–2960 (2013)

  43. Sarkar, P., Lopresti, D., Zhou, J., Nagy, G.: Spatial sampling of printed patterns. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 344–351 (1998)

    Article  Google Scholar 

  44. Sivaramakrishna, R., Shashidharf, N.: Hu’s moment invariants: How invariant are they under skew and perspective transformations? In: Proceedings of the WESCANEX 97: Communications, Power and Computing, pp. 292–295 (1997)

  45. Smith, B.: Characterization of image degradation caused by scanning. Pattern Recognit. Lett. 19(13), 1191–1197 (1998)

    Article  MATH  Google Scholar 

  46. Sridhar, M., Houle, G., Bakker, R., Kimura, F.: Comprehensive check image reader. In: Chaudhuri, B., Parui, S. (eds.) Advances in Digital Document Processing and Retrieval, pp. 123–156. World Scientific, Singapore (2014)

    Chapter  Google Scholar 

  47. The Linguistic Data Consortium. http://www.ldc.upenn.edu/ (2013)

  48. Tatele, S., Khare, A.: Character recognition and transmission of characters using network security. Int. J. Adv. Eng. Technol. 11, 351–360 (2011)

  49. Teague, M.: Image analysis via the general theory of moments. J. Opt. Soc. Am. 70(8), 920–930 (1980)

    Article  MathSciNet  Google Scholar 

  50. Uchida, S., Sakeo, H.: A survey of elastic matching techniques for handwritten character recognition. Trans. Inst. Electron. Inf. Commun. Eng. 88(D8), 1781–1790 (2005)

    Google Scholar 

  51. Wang, X., Yiao, B., Ma, J.F.: Scaling and rotation invariant analysis approach to object recognition based on Radon and Fourier–Mellin transforms. Pattern Recogn. 40(12), 3503–3508 (2007)

    Article  MATH  Google Scholar 

  52. Watt, S., Dragan, L.: Recognition for large sets of handwritten mathematical symbols. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 740–744 (2005)

  53. Wolf, C.: Document ink bleed-through removal with two Hidden Markov Random Fields and a single observation field. IEEE Trans. Pattern Anal. Mach. Intell. 32(3), 431–447 (2010)

  54. Yamada, H., Yamamoto, K., Saito, T.: A nonlinear normalization method for Kanji character recognition-line density equalization. Pattern Recognit. 23(9), 1023–1029 (1990)

    Article  Google Scholar 

  55. Yap, P., Paramesran, R., Seng-Huat, O.: Image analysis by Krawtchouk moments. IEEE Trans. Image Process. 12(11), 1367–1377 (2003)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

We thank the anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jin Chen.

Appendix

Appendix

For binary classification errors [18], we define:

  • Type I (false positive): detecting a class that is not present.

  • Type II (false negative): failing to detect a class that is present.

One often needs to compare the accuracy of one classification algorithm with that of another. According to Dietterich’s study of five statistical significance tests, McNemar’s [14] has a low probability of incorrectly detecting a difference when no difference exists.

Suppose there are two algorithms, baseline \(\mathcal {A}\) and proposed \(\mathcal {B}\). The available n samples are classified by both algorithms. It is observed that \(n_{10}\) of the samples are misclassified by classifier \(\mathcal {A}\) but not by \(\mathcal {B}\), \(n_{01}\) samples are misclassified only by \(\mathcal {B}\), and \(n_{11}\) samples are misclassified by both algorithms. The accuracy of \(\mathcal {A}\) is \(\mathcal {A} = (n - n_{10} - n_{11}) / n\), and the accuracy of \(\mathcal {B}\) is \(\mathcal {B} = (n - n_{01} - n_{11}) / n\). McNemar’s test is formulated as:

$$\begin{aligned} Z^{2} = \frac{(| n_{10} - n_{01} | - 1)^{2}}{n_{10} + n_{01}}. \end{aligned}$$
(5)

where \(n_{01}\), number of samples misclassified by the proposed algorithm \(\mathcal {B}\), but not by the baseline \(\mathcal {A}\); \(n_{10}\), number of samples misclassified by the baseline \(\mathcal {A}\), but not by the proposed algorithm \(\mathcal {B}\); null hypothesis \(\mathcal {H}_{0}\), \(\mathcal {A}\) \(=\) \(\mathcal {B}\); alternative hypothesis \(\mathcal {H}_{1}\), \(\mathcal {A} < \mathcal {B}\).

The test statistic \(Z^{2}\) approximately follows the Chi-square distribution with one degree of freedom. As a rule of thumb, we say one algorithm outperforms another significantly with a confidence level of 95 %. The test value \(Z^{2}\) corresponding to this 95 % confidence is 3.84. Although this statistic test is approximate, it is effective in detecting accuracy differences between algorithms [14].

As an example of the application of McNemar’s hypothesis test, consider two cases:

  • Case 1: \(n_{01} = 10\), \(n_{10} = 20\), therefore \(Z^{2} = 2.70\). We cannot conclude that \(\mathcal {B}\) is significantly better than \(\mathcal {A}\).

  • Case 2: \(n_{01} = 5\), \(n_{10} = 15\), \(Z^{2} = 4.05\). Therefore, \(\mathcal {B}\) is more accurate than \(\mathcal {A}\) at a confidence level of 95 %.

Note that if \(n_{11} = 20\) and \(n=100\) in both cases, then in Case 1 \(\mathcal {A} = 60\,\%\) and \(\mathcal {B} = 70\,\%\). In Case 2, \(\mathcal {A}=65\,\%\) and \(\mathcal {B} = 75\,\%\). Different values of \(n_{11}\), the number of samples misclassified by both algorithms, would give different accuracies for \(\mathcal {A}\) and \(\mathcal {B}\), but that would not change our conclusion with respect to their comparative accuracy. At low error rates and large sample sizes, small differences in accuracy can be statistically significant. It is, of course, essential to keep track of the specific errors. The commonly used recall and precision measure does not provide sufficient information for this test.

All of the differences in error rate reported as significant in Sects. 4 and 5 yielded confidence greater than 95 % with McNemar’s test.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, J., Lopresti, D. & Nagy, G. Conservative preprocessing of document images. IJDAR 19, 321–333 (2016). https://doi.org/10.1007/s10032-016-0273-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-016-0273-3

Keywords

Navigation