Abstract
Many preprocessing techniques intended to normalize artifacts and clean noise induce anomalies in part due to the discretized nature of the document image and in part due to inherent ambiguity in the input image relative to the desired transformation. The potentially deleterious effects of common preprocessing methods are illustrated through a series of dramatic albeit contrived examples and then shown to affect real applications of ongoing interest to the community through three writer identification experiments conducted on Arabic handwriting. Retaining ruling lines detected by multi-line linear regression instead of repairing strokes broken by deleting ruling lines reduced the error rate by 4.5 %. Exploiting word position relative to detected rulings instead of ignoring it decreased errors by 5.5 %. Counteracting page skew by rotating extracted contours during feature extraction instead of rectifying the page image reduced the error by 1.4 %. All of these accuracy gains are shown to be statistically significant. Analogous methods are advocated for other document processing tasks as topics for future research.
Similar content being viewed by others
Notes
This differs from the previous 60-writer setup because of new releases of datasets from LDC.
References
Abd-Almageed, W., Kumar, J., Doermann, D.: Page rule-line removal using linear subspaces in monochromatic handwritten Arabic documents. In: Proceedings of the 12th International Conference on Document Analysis and Recognition, pp. 768–772 (2009)
Abdou, I., Wong, K.: Analysis of linear interpolation schemes for bi-level image applications. IBM J. Res. Dev. 26(2), 667–680 (1982)
Agfa: An Introduction to Digital Scanning. Agfa-Gevaert (1994)
Arvind, K., Kumar, J., Ramakrishnan, A.: Line removal and restoration of handwritten strokes. In: Proceedings of the 7th International Conference on Computational Intelligence and Multimedia Application, pp. 208–214 (2007)
Baird, H.: Document image defect models. In: Baird, H., Bunke, H., Yamamoto, K. (eds.) Structured Document Image Analysis. Springer, Berlin (1995)
Bulacu, M., Schomaker, L.: Text-independent writer identification and verification using textural and allographic features. IEEE Trans. Pattern Anal. Mach. Intell. 29, 701–717 (2007)
Burns, P.: Slanted-edge MTF for digital camera and scanner analysis. In: Proceedings of the IS&T 2000 PICS Conference, pp. 135–138 (2000)
Cao, H., Prasad, R., Natarajan, P.: A stroke regeneration method for cleaning rule-lines in handwritten document images. In: Procedings of the MOCR Workshop at the 10th international Conference on Document Analysis and Recognition (2009)
Chen, J.: Information preserving processing of noisy handwritten document images. Ph.D. thesis, Lehigh University, Bethlehem, PA (2015)
Chen, J., Cao, H., Prasad, R., Bhadwaj, A., Natarajan, P.: Gabor features for offline Arabic handwriting recognition. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 53–58. Boston (2010)
Chen, J., Cheng, W., Lopresti, D.: Using perturbed handwriting to support writer identification in the presence of severe data constraints. In: Proceedings of the Document Recognition and Retrieval XVIII (IS&T/SPIE International Symposium on Electronic Imaging) (2011)
Cheriet, M., Kharma, N., Liu, C., Suen, C.: Character Recognition Systems. Wiley, Hoboken (2007)
Citing Feng Ping Shan Library, H.K.U.: China, Collection of Genealogies, 1239–2014. http://FamilySearch.org (2015)
Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895–1923 (1998)
Ding, X.: Machine printed Chinese character recognition. In: Bunke, H., Wang, P. (eds.) Handbook of Character Recognition and Document Image Analysis, 305–329. World Scientific, Singapore (1997)
Dodgson, N.: Image resampling. Technical Report. University of Cambridge (1992)
Doermann, D., Tombre, K.: Handbook of Document Image and Recognition. Springer, Berlin (2014)
Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, Hoboken (2000)
Favata, J., Srikantan, G.: A multiple feature/resolution approach to handprinted digit and character recognition. Int. J. Image Syst. Technol. 7(4), 304–311 (1998)
Fischer, A., Riesen, K., Bunke, H.: Graph similarity features for HMM-based handwriting recognition in historical documents. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, pp. 253–258 (2010)
Gonzalez, R., Woods, R.: Digital Image Processing, 3rd edn. Pearson, New Jersey (2008)
Ha, T., Bunke, H.: Image processing methods for document image analysis. In: Bunke, H., Wang, P. (eds.) Handbook of Character Recognition and Document Image Analysis. World Scientific, Singapore (1997)
Hu, M.: Visual pattern recognition by moment invariants. IRE Trans. Inf. Theory 8(2), 179–187 (1962)
Jung, D., Krishnamoorthy, M., Nagy, G., Shapira, A.: N-tuple features for OCR revisited. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 734–745 (1996)
Khotanzad, A., Homg, Y.: Invariant image recognition by Zernike moments. IEEE Trans. Pattern Anal. Mach. Intell. 12(5), 489–497 (1990)
Kmiec, M.: New optimal character recognition method based on Hu invariant moments and weighted voting. J. Appl. Comput. Sci. 19(1), 33–50 (2011)
Krishnamoorthy, M., Nagy, G., Seth, S., Viswanathan, M.: Syntactic segmentation and labeling of digitized pages from technical journals. IEEE Trans. Pattern Anal. Mach. Intell. 15(7), 737–747 (1993)
Kumar, J., Doermann, D.: Fast rule-line removal using integral images and support vector machines. In: Proceedings of the 11th International Conference on Document Analysis and Recognition, pp. 584–588 (2011)
Liu, C., Sako, H., Fujisawa, H.: Handwritten Chinese character recognition: alternatives to nonlinear normalization. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 524–528 (2003)
Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision, pp. 1150–1157 (1999)
Marinai, S.: Introduction to document analysis and recognition. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition, pp. 1–20. Springer, Berlin (2008)
Mohamad, R.A.H., Likforman-Sulem, L., Mokbel, C.: Combining slanted-frame classifiers for improved HMM-based Arabic handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(7), 1165–1177 (2009)
Nadler, M., Smith, E.: Pattern Recognition Engineering. Wiley, Hoboken (1993)
Nagy, G.: Optical scanning digitizers. IEEE Comput. 16(5), 13–24 (1983)
Nagy, G.: Preprocessing document images by resampling is error prone and unnecessary. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval (2013)
Natarajan, P., Lu, Z., Bazzi, I., Schwartz, R., Makhoul, J.: Multilingual machine printed OCR. Int. J. Pattern Recognit. Artif. Intell. 15(1), 43–63 (2001)
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Ouyang, T., Davis, R.: Recognition of hand drawn chemical diagrams. In: Proceedings of the Association for the Advancement of Artificial Intelligence (2007)
Pan, P., Zhu, Y., Sun, J., Naoi, S.: Recognizing characters with severe perspective distortion using hash tables and perspective invariants. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 548–552 (2011)
Parker, J., Kenyon, R., Troxel, D.: Comparison of interpolating methods for image resampling. IEEE Trans. Med. Imaging 2(1), 1983 (1983)
Rocha, J., Pavlidis, T.: Character recognition without segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 17(9), 903–909 (1995)
Rowley-Brooke, R., Pitié, F., Kokaram, A.: A non-parametric framework for document bleed-through removal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2954–2960 (2013)
Sarkar, P., Lopresti, D., Zhou, J., Nagy, G.: Spatial sampling of printed patterns. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 344–351 (1998)
Sivaramakrishna, R., Shashidharf, N.: Hu’s moment invariants: How invariant are they under skew and perspective transformations? In: Proceedings of the WESCANEX 97: Communications, Power and Computing, pp. 292–295 (1997)
Smith, B.: Characterization of image degradation caused by scanning. Pattern Recognit. Lett. 19(13), 1191–1197 (1998)
Sridhar, M., Houle, G., Bakker, R., Kimura, F.: Comprehensive check image reader. In: Chaudhuri, B., Parui, S. (eds.) Advances in Digital Document Processing and Retrieval, pp. 123–156. World Scientific, Singapore (2014)
The Linguistic Data Consortium. http://www.ldc.upenn.edu/ (2013)
Tatele, S., Khare, A.: Character recognition and transmission of characters using network security. Int. J. Adv. Eng. Technol. 11, 351–360 (2011)
Teague, M.: Image analysis via the general theory of moments. J. Opt. Soc. Am. 70(8), 920–930 (1980)
Uchida, S., Sakeo, H.: A survey of elastic matching techniques for handwritten character recognition. Trans. Inst. Electron. Inf. Commun. Eng. 88(D8), 1781–1790 (2005)
Wang, X., Yiao, B., Ma, J.F.: Scaling and rotation invariant analysis approach to object recognition based on Radon and Fourier–Mellin transforms. Pattern Recogn. 40(12), 3503–3508 (2007)
Watt, S., Dragan, L.: Recognition for large sets of handwritten mathematical symbols. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 740–744 (2005)
Wolf, C.: Document ink bleed-through removal with two Hidden Markov Random Fields and a single observation field. IEEE Trans. Pattern Anal. Mach. Intell. 32(3), 431–447 (2010)
Yamada, H., Yamamoto, K., Saito, T.: A nonlinear normalization method for Kanji character recognition-line density equalization. Pattern Recognit. 23(9), 1023–1029 (1990)
Yap, P., Paramesran, R., Seng-Huat, O.: Image analysis by Krawtchouk moments. IEEE Trans. Image Process. 12(11), 1367–1377 (2003)
Acknowledgments
We thank the anonymous reviewers for their valuable comments.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
For binary classification errors [18], we define:
-
Type I (false positive): detecting a class that is not present.
-
Type II (false negative): failing to detect a class that is present.
One often needs to compare the accuracy of one classification algorithm with that of another. According to Dietterich’s study of five statistical significance tests, McNemar’s [14] has a low probability of incorrectly detecting a difference when no difference exists.
Suppose there are two algorithms, baseline \(\mathcal {A}\) and proposed \(\mathcal {B}\). The available n samples are classified by both algorithms. It is observed that \(n_{10}\) of the samples are misclassified by classifier \(\mathcal {A}\) but not by \(\mathcal {B}\), \(n_{01}\) samples are misclassified only by \(\mathcal {B}\), and \(n_{11}\) samples are misclassified by both algorithms. The accuracy of \(\mathcal {A}\) is \(\mathcal {A} = (n - n_{10} - n_{11}) / n\), and the accuracy of \(\mathcal {B}\) is \(\mathcal {B} = (n - n_{01} - n_{11}) / n\). McNemar’s test is formulated as:
where \(n_{01}\), number of samples misclassified by the proposed algorithm \(\mathcal {B}\), but not by the baseline \(\mathcal {A}\); \(n_{10}\), number of samples misclassified by the baseline \(\mathcal {A}\), but not by the proposed algorithm \(\mathcal {B}\); null hypothesis \(\mathcal {H}_{0}\), \(\mathcal {A}\) \(=\) \(\mathcal {B}\); alternative hypothesis \(\mathcal {H}_{1}\), \(\mathcal {A} < \mathcal {B}\).
The test statistic \(Z^{2}\) approximately follows the Chi-square distribution with one degree of freedom. As a rule of thumb, we say one algorithm outperforms another significantly with a confidence level of 95 %. The test value \(Z^{2}\) corresponding to this 95 % confidence is 3.84. Although this statistic test is approximate, it is effective in detecting accuracy differences between algorithms [14].
As an example of the application of McNemar’s hypothesis test, consider two cases:
-
Case 1: \(n_{01} = 10\), \(n_{10} = 20\), therefore \(Z^{2} = 2.70\). We cannot conclude that \(\mathcal {B}\) is significantly better than \(\mathcal {A}\).
-
Case 2: \(n_{01} = 5\), \(n_{10} = 15\), \(Z^{2} = 4.05\). Therefore, \(\mathcal {B}\) is more accurate than \(\mathcal {A}\) at a confidence level of 95 %.
Note that if \(n_{11} = 20\) and \(n=100\) in both cases, then in Case 1 \(\mathcal {A} = 60\,\%\) and \(\mathcal {B} = 70\,\%\). In Case 2, \(\mathcal {A}=65\,\%\) and \(\mathcal {B} = 75\,\%\). Different values of \(n_{11}\), the number of samples misclassified by both algorithms, would give different accuracies for \(\mathcal {A}\) and \(\mathcal {B}\), but that would not change our conclusion with respect to their comparative accuracy. At low error rates and large sample sizes, small differences in accuracy can be statistically significant. It is, of course, essential to keep track of the specific errors. The commonly used recall and precision measure does not provide sufficient information for this test.
All of the differences in error rate reported as significant in Sects. 4 and 5 yielded confidence greater than 95 % with McNemar’s test.
Rights and permissions
About this article
Cite this article
Chen, J., Lopresti, D. & Nagy, G. Conservative preprocessing of document images. IJDAR 19, 321–333 (2016). https://doi.org/10.1007/s10032-016-0273-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-016-0273-3