Abstract
This paper proposes a method to compare document images in multilingual corpus, which is composed of character segmentation, feature extraction and similarity measure. In character segmentation, a top-down strategy is used. We apply projection and self-adaptive threshold to analyze the layout and then segment the text line by horizontal projection. Then, English, Chinese and Japanese are recognized by different methods based on the distribution and ratios of text line. Finally, character segmentation with different languages is done using different strategies. In feature extraction and similarity measure, four features are given for coarse measurement, and then a template is set up. Based on the templates, a fast template matching method based on coarse-to-fine strategy and bit memory is presented for precise matching. The experimental results demonstrate that our method can handle multilingual document images of different resolutions and font sizes with high precision and speed.














Similar content being viewed by others
References
Anthony, L.: Issues in the design and development of software tools for corpus studies: the case for collaboration. In: Baker, P. (ed.) Contemporary Corpus Linguistics, pp. 87–104. Continuum Press, London (2009)
Wang, K.: Sentence parallelism in English–Chinese/Chinese–English: a corpus-based investigation. Foreign Lang. Teach. Res. 6, 410–416 (2003)
Maguire, P., Wisniewski, E.J., Storms, G.: A corpus study of semantic patterns in compounding. Corpus Linguist. Linguist. Theory 6, 49–73 (2010)
Gahl, S., Cibelli, E., Hall, K., Sprouse, R.: The “Up” corpus: a corpus of speech samples across adulthood. Corpus Linguist. Linguist. Theory 10(2), 315–328 (2014)
De Knop, S., Meunier, F.: The ‘Learner Corpus Research, Cognitive Linguistics and Second Language Acquisition’ nexus: a SWOT analysis. Corpus Linguist. Linguist. Theory 11(1), 1–18 (2015)
Biber, D., Conrad, S., Reppen, R.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge (1998)
Liang, M.: Rationalism, empiricism and corpus linguistics. Foreign Lang. China 4, 90–97 (2010)
McEnery, T., Hardie, A.: Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, Cambridge (2011)
Anthony, L.: A critical look at software tools in corpus linguistics. Linguist. Res. 30(2), 141–161 (2013)
Casey, R.G., Lecolinet, E.: A survey of methods and strategies in character segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 690–706 (1996)
Dos Santos, R.P., Clemente, G.S., Ren, T.I., Cavalcanti, G.D.: Text line segmentation based on morphology and histogram projection. In: Proceedings of 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 651–655 (2009)
Qiao, S., Zhu, Y., Li, X., Liu, T., Zhang, B.: Research of improving the accuracy of license plate character segmentation. In: Proceedings of the 2010 Fifth International Conference on Frontier of Computer Science and Technology, pp. 489–493 (2010)
Ariyoshi, S.: A character segmentation method for Japanese printed documents coping with touching character problems. In: Proceedings of 11th IAPR International Conference on Pattern Recognition, pp. 313–316 (1992)
Nicolaou, A., Gatos, B.: Handwritten text line segmentation by shredding text into its lines. In: Proceedings of 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 626–630 (2009)
Nikolaou, N., Makridis, M., Gatos, B., Stamatopoulos, N., Papamarkos, N.: Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis. Comput. 28(4), 590–604 (2010)
Kim, H.Y.: Segmentation-free printed character recognition by relaxed nearest neighbor learning of windowed operator. In: Proceedings of Brazilian Symposium on Computer Graphics and Image Processing, pp. 195–204 (1999)
Louloudis, G., Stamatopoulos, N., Gatos, B.: A novel two stage evaluation methodology for word segmentation techniques. In: Proceedings of 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 686–690 (2009)
Yu, Z.: Similarity Measure of Text Images. Master’s thesis, School of Computing, Nat’l Univ. Singapore (2000)
Tan, C.L., Huang, W., Yu, Z., Xu, Y.: Imaged document text retrieval without OCR. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 838–844 (2002)
Tan, C.L., Huang, W., Sung, S.Y., Yu, Z., Xu, Y.: Text retrieval from document images based on word shape analysis. Appl. Intell. 18(3), 257–270 (2003)
Zagoris, K., Ergina, K., Papamarkos, N.: A document image retrieval system. Eng. Appl. Artif. Intell. 23(6), 872–879 (2010)
Zhang, J., Huang, X.L., Lv, H.: Character retrieval based on the improved contour feature and texture feature. In: Proceedings of International Conference on Management and Service Science (MASS), pp. 1–3 (2009)
Xiaogang, Y., Fei, C., Dong, M., Yunpeng, Z.: Study on the image grayscale matching algorithm based on similarity measures. Syst. Eng. Electron. 27(5), 918–921 (2005)
Bao-sheng, L., Li-ping, Y., Dong-hua, Z.: Comparison of some classical similarity measures. Appl. Res. Comput. 23(11), 1–3 (2006)
Van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.M.: Distance measures for layout-based document image retrieval. In: Proceedings of Second International Conference on Document Image Analysis for Libraries (DIAL), pp. 11–242 (2006)
Mahalanobis, P.C.: On the generalized distance in statistics. In: Proceedings of Proceedings of the National Institute of Sciences, pp. 49–55 (1936)
Son, H., Kim, S., Kim, J.: Text image matching without language model using a Hausdorff distance. Inf. Process. Manag. 44(3), 1189–1200 (2008)
Abirami, S., Manjula, D.: Profile based information retrieval from printed document images. In: Proceedings of Computer Graphics, Imaging and Visualisation (CGIV), pp. 268–272 (2007)
Samet, H., Tamminen, M.: Efficient component labeling of images of arbitrary dimension represented by linear bintrees. IEEE Trans. Pattern Anal. Mach. Intell. 10(4), 579–586 (1988)
Dillencourt, M.B., Samet, H., Tamminen, M.: A general approach to connected-component labeling for arbitrary image representations. J. ACM 39(2), 253–280 (1992)
Zhu, X., Suk, H., Shen, D.: A novel matrix-similarity based loss function for joint regression and classification in AD diagnosis. NeuroImage 100, 91–105 (2014)
Acknowledgments
This work was supported by Shaanxi Social Science Foundation of China under Grant Nos. 13K093 and 2015K014, National Social Science Foundation of China under Grant No. 12BYY055, Shaanxi Twelfth Year Planning Foundation of China under Grant No. SGH13015 and the Fundamental Research Funds for the Central Universities under Grant No. Sk2013008.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lin, Y., Li, Y., Song, Y. et al. Fast document image comparison in multilingual corpus without OCR. Multimedia Systems 23, 315–324 (2017). https://doi.org/10.1007/s00530-015-0484-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-015-0484-3