Skip to main content
Log in

Fast document image comparison in multilingual corpus without OCR

  • Special Issue Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

This paper proposes a method to compare document images in multilingual corpus, which is composed of character segmentation, feature extraction and similarity measure. In character segmentation, a top-down strategy is used. We apply projection and self-adaptive threshold to analyze the layout and then segment the text line by horizontal projection. Then, English, Chinese and Japanese are recognized by different methods based on the distribution and ratios of text line. Finally, character segmentation with different languages is done using different strategies. In feature extraction and similarity measure, four features are given for coarse measurement, and then a template is set up. Based on the templates, a fast template matching method based on coarse-to-fine strategy and bit memory is presented for precise matching. The experimental results demonstrate that our method can handle multilingual document images of different resolutions and font sizes with high precision and speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Anthony, L.: Issues in the design and development of software tools for corpus studies: the case for collaboration. In: Baker, P. (ed.) Contemporary Corpus Linguistics, pp. 87–104. Continuum Press, London (2009)

    Google Scholar 

  2. Wang, K.: Sentence parallelism in English–Chinese/Chinese–English: a corpus-based investigation. Foreign Lang. Teach. Res. 6, 410–416 (2003)

    Google Scholar 

  3. Maguire, P., Wisniewski, E.J., Storms, G.: A corpus study of semantic patterns in compounding. Corpus Linguist. Linguist. Theory 6, 49–73 (2010)

    Article  Google Scholar 

  4. Gahl, S., Cibelli, E., Hall, K., Sprouse, R.: The “Up” corpus: a corpus of speech samples across adulthood. Corpus Linguist. Linguist. Theory 10(2), 315–328 (2014)

    Article  Google Scholar 

  5. De Knop, S., Meunier, F.: The ‘Learner Corpus Research, Cognitive Linguistics and Second Language Acquisition’ nexus: a SWOT analysis. Corpus Linguist. Linguist. Theory 11(1), 1–18 (2015)

    Article  Google Scholar 

  6. Biber, D., Conrad, S., Reppen, R.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge (1998)

    Book  Google Scholar 

  7. Liang, M.: Rationalism, empiricism and corpus linguistics. Foreign Lang. China 4, 90–97 (2010)

    Google Scholar 

  8. McEnery, T., Hardie, A.: Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, Cambridge (2011)

    Book  Google Scholar 

  9. Anthony, L.: A critical look at software tools in corpus linguistics. Linguist. Res. 30(2), 141–161 (2013)

    Article  Google Scholar 

  10. Casey, R.G., Lecolinet, E.: A survey of methods and strategies in character segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 690–706 (1996)

    Article  Google Scholar 

  11. Dos Santos, R.P., Clemente, G.S., Ren, T.I., Cavalcanti, G.D.: Text line segmentation based on morphology and histogram projection. In: Proceedings of 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 651–655 (2009)

  12. Qiao, S., Zhu, Y., Li, X., Liu, T., Zhang, B.: Research of improving the accuracy of license plate character segmentation. In: Proceedings of the 2010 Fifth International Conference on Frontier of Computer Science and Technology, pp. 489–493 (2010)

  13. Ariyoshi, S.: A character segmentation method for Japanese printed documents coping with touching character problems. In: Proceedings of 11th IAPR International Conference on Pattern Recognition, pp. 313–316 (1992)

  14. Nicolaou, A., Gatos, B.: Handwritten text line segmentation by shredding text into its lines. In: Proceedings of 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 626–630 (2009)

  15. Nikolaou, N., Makridis, M., Gatos, B., Stamatopoulos, N., Papamarkos, N.: Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis. Comput. 28(4), 590–604 (2010)

    Article  Google Scholar 

  16. Kim, H.Y.: Segmentation-free printed character recognition by relaxed nearest neighbor learning of windowed operator. In: Proceedings of Brazilian Symposium on Computer Graphics and Image Processing, pp. 195–204 (1999)

  17. Louloudis, G., Stamatopoulos, N., Gatos, B.: A novel two stage evaluation methodology for word segmentation techniques. In: Proceedings of 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 686–690 (2009)

  18. Yu, Z.: Similarity Measure of Text Images. Master’s thesis, School of Computing, Nat’l Univ. Singapore (2000)

  19. Tan, C.L., Huang, W., Yu, Z., Xu, Y.: Imaged document text retrieval without OCR. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 838–844 (2002)

    Article  Google Scholar 

  20. Tan, C.L., Huang, W., Sung, S.Y., Yu, Z., Xu, Y.: Text retrieval from document images based on word shape analysis. Appl. Intell. 18(3), 257–270 (2003)

    Article  MATH  Google Scholar 

  21. Zagoris, K., Ergina, K., Papamarkos, N.: A document image retrieval system. Eng. Appl. Artif. Intell. 23(6), 872–879 (2010)

    Article  Google Scholar 

  22. Zhang, J., Huang, X.L., Lv, H.: Character retrieval based on the improved contour feature and texture feature. In: Proceedings of International Conference on Management and Service Science (MASS), pp. 1–3 (2009)

  23. Xiaogang, Y., Fei, C., Dong, M., Yunpeng, Z.: Study on the image grayscale matching algorithm based on similarity measures. Syst. Eng. Electron. 27(5), 918–921 (2005)

    Google Scholar 

  24. Bao-sheng, L., Li-ping, Y., Dong-hua, Z.: Comparison of some classical similarity measures. Appl. Res. Comput. 23(11), 1–3 (2006)

    Google Scholar 

  25. Van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.M.: Distance measures for layout-based document image retrieval. In: Proceedings of Second International Conference on Document Image Analysis for Libraries (DIAL), pp. 11–242 (2006)

  26. Mahalanobis, P.C.: On the generalized distance in statistics. In: Proceedings of Proceedings of the National Institute of Sciences, pp. 49–55 (1936)

  27. Son, H., Kim, S., Kim, J.: Text image matching without language model using a Hausdorff distance. Inf. Process. Manag. 44(3), 1189–1200 (2008)

    Article  Google Scholar 

  28. Abirami, S., Manjula, D.: Profile based information retrieval from printed document images. In: Proceedings of Computer Graphics, Imaging and Visualisation (CGIV), pp. 268–272 (2007)

  29. Samet, H., Tamminen, M.: Efficient component labeling of images of arbitrary dimension represented by linear bintrees. IEEE Trans. Pattern Anal. Mach. Intell. 10(4), 579–586 (1988)

    Article  Google Scholar 

  30. Dillencourt, M.B., Samet, H., Tamminen, M.: A general approach to connected-component labeling for arbitrary image representations. J. ACM 39(2), 253–280 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  31. Zhu, X., Suk, H., Shen, D.: A novel matrix-similarity based loss function for joint regression and classification in AD diagnosis. NeuroImage 100, 91–105 (2014)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by Shaanxi Social Science Foundation of China under Grant Nos. 13K093 and 2015K014, National Social Science Foundation of China under Grant No. 12BYY055, Shaanxi Twelfth Year Planning Foundation of China under Grant No. SGH13015 and the Fundamental Research Funds for the Central Universities under Grant No. Sk2013008.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yonghong Song.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, Y., Li, Y., Song, Y. et al. Fast document image comparison in multilingual corpus without OCR. Multimedia Systems 23, 315–324 (2017). https://doi.org/10.1007/s00530-015-0484-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-015-0484-3

Keywords

Navigation