Skip to main content
Log in

Multilingual corpus construction based on printed and handwritten character separation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper proposes an effective method to extract printed and handwritten characters from multilingual document images to build corpus. To extract the characters from the document images, a connected component analysis method is used to remove the graphics. After that, multiple types of features and AdaBoost algorithm are introduced to classify printed and handwritten characters in a more versatile and robust way. Firstly, the content of the image is divided into several text patches which are then used to distinguish different languages. Secondly, we use the multiple types of features and AdaBoost algorithm to train the classifiers based on the segmented patches. Finally, we can separate printed and handwritten parts of new image set by the trained classifiers. The proposed method improves the precision of the extraction of written materials in text images of different languages. Experimental results demonstrate that the proposed method is more accurate in terms of precision and recall rate compared with the state-of the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Agam, G., Argamon, S., Frieder, O., Grossman, D., Lewis, D.: The complex document image processing (CDIP) test collection. Illinois Inst Technol (2006)

  2. Anthony L (2013) A critical look at software tools in corpus linguistics. Linguist Res 30(2):141–161

    Article  Google Scholar 

  3. Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intel 27(11):1720–1732

    Article  Google Scholar 

  4. Chellappa R, Chatterjee S (1985) Classification of textures using Gaussian Markov random fields. IEEE Trans Acoust, Speech Signal Process 33(4):959–963

    Article  MathSciNet  Google Scholar 

  5. Drivas, D., Amin, A.: Page segmentation and classification utilising a bottom-up approach. In: Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 610–614. (1995)

  6. Fan K, Wang L, Tu Y (1998) Classification of machine-printed and handwritten texts using character block layout variance. Pattern Recogn 31(9):1275–1284

    Article  Google Scholar 

  7. Franke, J., Oberlander, M.: Writing style detection by statistical combination of classifiers in form reader applications. In: Proceedings of the Second International Conference on Document Analysis and Recognition, pp. 581–584. (1993)

  8. Gao Y, Wang M, Tao D, Ji R, Dai Q (2012) 3D object retrieval and recognition with hypergraph analysis. IEEE Trans Image Process 21(9):4290–4303

    Article  MathSciNet  Google Scholar 

  9. Gao Y, Wang M, Zha Z, Shen J, Li X, Wu X (2013) Visual-textual joint relevance learning for tag-based social image search. IEEE Trans Image Process 22(1):363–376

    Article  MathSciNet  Google Scholar 

  10. Gatos B, Stamatopoulos N, Louloudis G (2011) ICDAR2009 handwriting segmentation contest. IJDAR 14(1):25–33

    Article  Google Scholar 

  11. Guo, J.K., Ma, M.Y.: Separating handwritten material from machine printed text using hidden Markov models. In: Proceedings of Sixth International Conference on Document Analysis and Recognition, pp. 439–443. (2001)

  12. Hochberg, J., Kerns, L., Kelly, P., Thomas, T.: Automatic script identification from images using cluster-based templates. In: Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 378–381. (1995)

  13. Jain AK, Zhong Y (1996) Page segmentation using texture analysis. Pattern Recogn 29(5):743–770

    Article  Google Scholar 

  14. Johansson S (2002) Towards a multilingual corpus for contrastive analysis and translation studies. Lang Comput 43(1):47–59

    Google Scholar 

  15. Koyama, J., Kato, M., Hirose, A.: Handwritten character distinction method inspired by human vision mechanism. In: Proceedings of Neural Information Processing, pp. 1031–1040. (2008)

  16. Kuhnke, K., Simoncini, L., Kovacs-V, Z.M.: A system for machine-written and hand-written character distinction. In: Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 811–814. (1995)

  17. Kundu, A., He, Y., Bahl, P.: Recognition of handwritten word: first and second order hidden Markov model based approach. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), pp. 457–462. (1988)

  18. Lewis D, Agam G, Argamon S, Frieder O, Grossman D, Heard J (2006) Building a test collection for complex document information processing. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 665–666

  19. Liu Q, Zha Z, Yang Y (2014) Gradient-domain-based enhancement of multi-view depth video. Inf Sci 281:750–761

    Article  MathSciNet  Google Scholar 

  20. Maguire P, Wisniewski EJ, Storms G (2010) A corpus study of semantic patterns in compounding. Corpus Linguist Linguist Theory 6:49–73

    Google Scholar 

  21. Pal U, Chaudhuri BB (2001) Machine-printed and hand-written text lines identification. Pattern Recogn Lett 22(3–4):431–441

    Article  MATH  Google Scholar 

  22. Soffer, A.: Image categorization using texture features. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition, pp. 233–237. (1997)

  23. Srihari SN, Shin YC, Ramanaprasad V, Lee DS (1996) A system to read names and addresses on tax forms. Proceedings of the IEEE 84(7):1038–1049

  24. Tan TN (1998) Rotation invariant texture features and their use in automatic script identification. IEEE Transact Pattern Anal Mach Intell 20(7):751–756

    Article  Google Scholar 

  25. Vyatkina N (2014) Review of multilingual corpora and multilingual corpus analysis. Lang Learn Technol 18(2):70–74

    Google Scholar 

  26. Zheng Y, Liu C, Ding X (2001) Single-character type identification. In: Proceedings of SPIE Conference Document Recognition and Retrieval, pp. 49–56

  27. Zheng Y, Li H, Doermann D (2002) The segmentation and identification of handwriting in noisy document images. In: Proceedings of the 5th International Workshop on Document Analysis Systems, pp. 95–105

  28. Zheng Y, Li H, Doermann D (2004) Machine printed text and handwriting identification in noisy document images. IEEE Trans Pattern Anal Mach Intell 26(3):337–353

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by Shaanxi Social Science Foundation of China under Grant Nos. 13 K093 and 2015 K014, National Social Science Foundation of China under Grant No. 12BYY055, Shaanxi Twelfth Year Planning Foundation of China under Grant No. SGH13015, Social Science Foundation of Ministry of Education of China under Grant No.15YJA740016 and the Fundamental Research Funds for the Central Universities under Grant No. Sk2013008.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yonghong Song.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, Y., Song, Y., Li, Y. et al. Multilingual corpus construction based on printed and handwritten character separation. Multimed Tools Appl 76, 4123–4139 (2017). https://doi.org/10.1007/s11042-015-2995-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2995-5

Keywords

Navigation