Conservative preprocessing of document images

Chen, Jin; Lopresti, Daniel; Nagy, George

doi:10.1007/s10032-016-0273-3

Jin Chen¹,
Daniel Lopresti² &
George Nagy³

528 Accesses
5 Citations
Explore all metrics

Abstract

Many preprocessing techniques intended to normalize artifacts and clean noise induce anomalies in part due to the discretized nature of the document image and in part due to inherent ambiguity in the input image relative to the desired transformation. The potentially deleterious effects of common preprocessing methods are illustrated through a series of dramatic albeit contrived examples and then shown to affect real applications of ongoing interest to the community through three writer identification experiments conducted on Arabic handwriting. Retaining ruling lines detected by multi-line linear regression instead of repairing strokes broken by deleting ruling lines reduced the error rate by 4.5 %. Exploiting word position relative to detected rulings instead of ignoring it decreased errors by 5.5 %. Counteracting page skew by rotating extracted contours during feature extraction instead of rectifying the page image reduced the error by 1.4 %. All of these accuracy gains are shown to be statistically significant. Analogous methods are advocated for other document processing tasks as topics for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

Article 11 April 2015

Olga Russakovsky, Jia Deng, … Li Fei-Fei

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Article Open access 22 November 2021

Thomas Hegghammer

Notes

This differs from the previous 60-writer setup because of new releases of datasets from LDC.

References

Abd-Almageed, W., Kumar, J., Doermann, D.: Page rule-line removal using linear subspaces in monochromatic handwritten Arabic documents. In: Proceedings of the 12th International Conference on Document Analysis and Recognition, pp. 768–772 (2009)
Abdou, I., Wong, K.: Analysis of linear interpolation schemes for bi-level image applications. IBM J. Res. Dev. 26(2), 667–680 (1982)
Article Google Scholar
Agfa: An Introduction to Digital Scanning. Agfa-Gevaert (1994)
Arvind, K., Kumar, J., Ramakrishnan, A.: Line removal and restoration of handwritten strokes. In: Proceedings of the 7th International Conference on Computational Intelligence and Multimedia Application, pp. 208–214 (2007)
Baird, H.: Document image defect models. In: Baird, H., Bunke, H., Yamamoto, K. (eds.) Structured Document Image Analysis. Springer, Berlin (1995)
Google Scholar
Bulacu, M., Schomaker, L.: Text-independent writer identification and verification using textural and allographic features. IEEE Trans. Pattern Anal. Mach. Intell. 29, 701–717 (2007)
Article Google Scholar
Burns, P.: Slanted-edge MTF for digital camera and scanner analysis. In: Proceedings of the IS&T 2000 PICS Conference, pp. 135–138 (2000)
Cao, H., Prasad, R., Natarajan, P.: A stroke regeneration method for cleaning rule-lines in handwritten document images. In: Procedings of the MOCR Workshop at the 10th international Conference on Document Analysis and Recognition (2009)
Chen, J.: Information preserving processing of noisy handwritten document images. Ph.D. thesis, Lehigh University, Bethlehem, PA (2015)
Chen, J., Cao, H., Prasad, R., Bhadwaj, A., Natarajan, P.: Gabor features for offline Arabic handwriting recognition. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 53–58. Boston (2010)
Chen, J., Cheng, W., Lopresti, D.: Using perturbed handwriting to support writer identification in the presence of severe data constraints. In: Proceedings of the Document Recognition and Retrieval XVIII (IS&T/SPIE International Symposium on Electronic Imaging) (2011)
Cheriet, M., Kharma, N., Liu, C., Suen, C.: Character Recognition Systems. Wiley, Hoboken (2007)
Book MATH Google Scholar
Citing Feng Ping Shan Library, H.K.U.: China, Collection of Genealogies, 1239–2014. http://FamilySearch.org (2015)
Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895–1923 (1998)
Article Google Scholar
Ding, X.: Machine printed Chinese character recognition. In: Bunke, H., Wang, P. (eds.) Handbook of Character Recognition and Document Image Analysis, 305–329. World Scientific, Singapore (1997)
Google Scholar
Dodgson, N.: Image resampling. Technical Report. University of Cambridge (1992)
Doermann, D., Tombre, K.: Handbook of Document Image and Recognition. Springer, Berlin (2014)
Book MATH Google Scholar
Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, Hoboken (2000)
MATH Google Scholar
Favata, J., Srikantan, G.: A multiple feature/resolution approach to handprinted digit and character recognition. Int. J. Image Syst. Technol. 7(4), 304–311 (1998)
Article Google Scholar
Fischer, A., Riesen, K., Bunke, H.: Graph similarity features for HMM-based handwriting recognition in historical documents. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, pp. 253–258 (2010)
Gonzalez, R., Woods, R.: Digital Image Processing, 3rd edn. Pearson, New Jersey (2008)
Google Scholar
Ha, T., Bunke, H.: Image processing methods for document image analysis. In: Bunke, H., Wang, P. (eds.) Handbook of Character Recognition and Document Image Analysis. World Scientific, Singapore (1997)
Google Scholar
Hu, M.: Visual pattern recognition by moment invariants. IRE Trans. Inf. Theory 8(2), 179–187 (1962)
Article MATH Google Scholar
Jung, D., Krishnamoorthy, M., Nagy, G., Shapira, A.: N-tuple features for OCR revisited. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 734–745 (1996)
Article Google Scholar
Khotanzad, A., Homg, Y.: Invariant image recognition by Zernike moments. IEEE Trans. Pattern Anal. Mach. Intell. 12(5), 489–497 (1990)
Article Google Scholar
Kmiec, M.: New optimal character recognition method based on Hu invariant moments and weighted voting. J. Appl. Comput. Sci. 19(1), 33–50 (2011)
Google Scholar
Krishnamoorthy, M., Nagy, G., Seth, S., Viswanathan, M.: Syntactic segmentation and labeling of digitized pages from technical journals. IEEE Trans. Pattern Anal. Mach. Intell. 15(7), 737–747 (1993)
Article Google Scholar
Kumar, J., Doermann, D.: Fast rule-line removal using integral images and support vector machines. In: Proceedings of the 11th International Conference on Document Analysis and Recognition, pp. 584–588 (2011)
Liu, C., Sako, H., Fujisawa, H.: Handwritten Chinese character recognition: alternatives to nonlinear normalization. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 524–528 (2003)
Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision, pp. 1150–1157 (1999)
Marinai, S.: Introduction to document analysis and recognition. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition, pp. 1–20. Springer, Berlin (2008)
Chapter Google Scholar
Mohamad, R.A.H., Likforman-Sulem, L., Mokbel, C.: Combining slanted-frame classifiers for improved HMM-based Arabic handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(7), 1165–1177 (2009)
Article Google Scholar
Nadler, M., Smith, E.: Pattern Recognition Engineering. Wiley, Hoboken (1993)
MATH Google Scholar
Nagy, G.: Optical scanning digitizers. IEEE Comput. 16(5), 13–24 (1983)
Article Google Scholar
Nagy, G.: Preprocessing document images by resampling is error prone and unnecessary. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval (2013)
Natarajan, P., Lu, Z., Bazzi, I., Schwartz, R., Makhoul, J.: Multilingual machine printed OCR. Int. J. Pattern Recognit. Artif. Intell. 15(1), 43–63 (2001)
Article Google Scholar
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Article MathSciNet Google Scholar
Ouyang, T., Davis, R.: Recognition of hand drawn chemical diagrams. In: Proceedings of the Association for the Advancement of Artificial Intelligence (2007)
Pan, P., Zhu, Y., Sun, J., Naoi, S.: Recognizing characters with severe perspective distortion using hash tables and perspective invariants. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 548–552 (2011)
Parker, J., Kenyon, R., Troxel, D.: Comparison of interpolating methods for image resampling. IEEE Trans. Med. Imaging 2(1), 1983 (1983)
Article Google Scholar
Rocha, J., Pavlidis, T.: Character recognition without segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 17(9), 903–909 (1995)
Article Google Scholar
Rowley-Brooke, R., Pitié, F., Kokaram, A.: A non-parametric framework for document bleed-through removal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2954–2960 (2013)
Sarkar, P., Lopresti, D., Zhou, J., Nagy, G.: Spatial sampling of printed patterns. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 344–351 (1998)
Article Google Scholar
Sivaramakrishna, R., Shashidharf, N.: Hu’s moment invariants: How invariant are they under skew and perspective transformations? In: Proceedings of the WESCANEX 97: Communications, Power and Computing, pp. 292–295 (1997)
Smith, B.: Characterization of image degradation caused by scanning. Pattern Recognit. Lett. 19(13), 1191–1197 (1998)
Article MATH Google Scholar
Sridhar, M., Houle, G., Bakker, R., Kimura, F.: Comprehensive check image reader. In: Chaudhuri, B., Parui, S. (eds.) Advances in Digital Document Processing and Retrieval, pp. 123–156. World Scientific, Singapore (2014)
Chapter Google Scholar
The Linguistic Data Consortium. http://www.ldc.upenn.edu/ (2013)
Tatele, S., Khare, A.: Character recognition and transmission of characters using network security. Int. J. Adv. Eng. Technol. 11, 351–360 (2011)
Teague, M.: Image analysis via the general theory of moments. J. Opt. Soc. Am. 70(8), 920–930 (1980)
Article MathSciNet Google Scholar
Uchida, S., Sakeo, H.: A survey of elastic matching techniques for handwritten character recognition. Trans. Inst. Electron. Inf. Commun. Eng. 88(D8), 1781–1790 (2005)
Google Scholar
Wang, X., Yiao, B., Ma, J.F.: Scaling and rotation invariant analysis approach to object recognition based on Radon and Fourier–Mellin transforms. Pattern Recogn. 40(12), 3503–3508 (2007)
Article MATH Google Scholar
Watt, S., Dragan, L.: Recognition for large sets of handwritten mathematical symbols. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 740–744 (2005)
Wolf, C.: Document ink bleed-through removal with two Hidden Markov Random Fields and a single observation field. IEEE Trans. Pattern Anal. Mach. Intell. 32(3), 431–447 (2010)
Yamada, H., Yamamoto, K., Saito, T.: A nonlinear normalization method for Kanji character recognition-line density equalization. Pattern Recognit. 23(9), 1023–1029 (1990)
Article Google Scholar
Yap, P., Paramesran, R., Seng-Huat, O.: Image analysis by Krawtchouk moments. IEEE Trans. Image Process. 12(11), 1367–1377 (2003)
Article MathSciNet Google Scholar

Download references

Acknowledgments

We thank the anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

Nuance Communications, 675 Massachusetts Avenue, Cambridge, MA, 02139, USA
Jin Chen
CSE Department, Lehigh University, 19 Memorial Drive West, Bethlehem, PA, 18015, USA
Daniel Lopresti
ECSE Department, Rensselaer Polytechnic Institute, 6020 Johnsson Engineering Center, Troy, NY, 12180, USA
George Nagy

Authors

Jin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Lopresti
View author publications
You can also search for this author in PubMed Google Scholar
George Nagy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jin Chen.

Appendix

For binary classification errors [18], we define:

Type I (false positive): detecting a class that is not present.
Type II (false negative): failing to detect a class that is present.

One often needs to compare the accuracy of one classification algorithm with that of another. According to Dietterich’s study of five statistical significance tests, McNemar’s [14] has a low probability of incorrectly detecting a difference when no difference exists.

Suppose there are two algorithms, baseline $\mathcal {A}$ and proposed $\mathcal {B}$. The available n samples are classified by both algorithms. It is observed that $n_{10}$ of the samples are misclassified by classifier $\mathcal {A}$ but not by $\mathcal {B}$, $n_{01}$ samples are misclassified only by $\mathcal {B}$, and $n_{11}$ samples are misclassified by both algorithms. The accuracy of $\mathcal {A}$ is $\mathcal {A} = (n - n_{10} - n_{11}) / n$, and the accuracy of $\mathcal {B}$ is $\mathcal {B} = (n - n_{01} - n_{11}) / n$. McNemar’s test is formulated as:

$$\begin{aligned} Z^{2} = \frac{(| n_{10} - n_{01} | - 1)^{2}}{n_{10} + n_{01}}. \end{aligned}$$

(5)

where $n_{01}$, number of samples misclassified by the proposed algorithm $\mathcal {B}$, but not by the baseline $\mathcal {A}$; $n_{10}$, number of samples misclassified by the baseline $\mathcal {A}$, but not by the proposed algorithm $\mathcal {B}$; null hypothesis $\mathcal {H}_{0}$, $\mathcal {A}$ $=$ $\mathcal {B}$; alternative hypothesis $\mathcal {H}_{1}$, $\mathcal {A} < \mathcal {B}$.

The test statistic $Z^{2}$ approximately follows the Chi-square distribution with one degree of freedom. As a rule of thumb, we say one algorithm outperforms another significantly with a confidence level of 95 %. The test value $Z^{2}$ corresponding to this 95 % confidence is 3.84. Although this statistic test is approximate, it is effective in detecting accuracy differences between algorithms [14].

As an example of the application of McNemar’s hypothesis test, consider two cases:

Case 1: $n_{01} = 10$, $n_{10} = 20$, therefore $Z^{2} = 2.70$. We cannot conclude that $\mathcal {B}$ is significantly better than $\mathcal {A}$.
Case 2: $n_{01} = 5$, $n_{10} = 15$, $Z^{2} = 4.05$. Therefore, $\mathcal {B}$ is more accurate than $\mathcal {A}$ at a confidence level of 95 %.

Note that if $n_{11} = 20$ and $n=100$ in both cases, then in Case 1 $\mathcal {A} = 60\,\%$ and $\mathcal {B} = 70\,\%$. In Case 2, $\mathcal {A}=65\,\%$ and $\mathcal {B} = 75\,\%$. Different values of $n_{11}$, the number of samples misclassified by both algorithms, would give different accuracies for $\mathcal {A}$ and $\mathcal {B}$, but that would not change our conclusion with respect to their comparative accuracy. At low error rates and large sample sizes, small differences in accuracy can be statistically significant. It is, of course, essential to keep track of the specific errors. The commonly used recall and precision measure does not provide sufficient information for this test.

All of the differences in error rate reported as significant in Sects. 4 and 5 yielded confidence greater than 95 % with McNemar’s test.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, J., Lopresti, D. & Nagy, G. Conservative preprocessing of document images. IJDAR 19, 321–333 (2016). https://doi.org/10.1007/s10032-016-0273-3

Download citation

Received: 14 October 2015
Revised: 16 August 2016
Accepted: 27 August 2016
Published: 20 September 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s10032-016-0273-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Conservative preprocessing of document images

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Conservative preprocessing of document images

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation