Fast document image comparison in multilingual corpus without OCR

Lin, Yuping; Li, Yingyu; Song, Yonghong; Wang, Fang

doi:10.1007/s00530-015-0484-3

Fast document image comparison in multilingual corpus without OCR

Special Issue Paper
Published: 08 October 2015

Volume 23, pages 315–324, (2017)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Yuping Lin¹,
Yingyu Li¹,
Yonghong Song² &
…
Fang Wang¹

301 Accesses
Explore all metrics

Abstract

This paper proposes a method to compare document images in multilingual corpus, which is composed of character segmentation, feature extraction and similarity measure. In character segmentation, a top-down strategy is used. We apply projection and self-adaptive threshold to analyze the layout and then segment the text line by horizontal projection. Then, English, Chinese and Japanese are recognized by different methods based on the distribution and ratios of text line. Finally, character segmentation with different languages is done using different strategies. In feature extraction and similarity measure, four features are given for coarse measurement, and then a template is set up. Based on the templates, a fast template matching method based on coarse-to-fine strategy and bit memory is presented for precise matching. The experimental results demonstrate that our method can handle multilingual document images of different resolutions and font sizes with high precision and speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multilingual corpus construction based on printed and handwritten character separation

Article 24 October 2015

Twofold Detection of Multilingual Documents Using Local Features

Script identification algorithms: a survey

Article 29 July 2017

References

Anthony, L.: Issues in the design and development of software tools for corpus studies: the case for collaboration. In: Baker, P. (ed.) Contemporary Corpus Linguistics, pp. 87–104. Continuum Press, London (2009)
Google Scholar
Wang, K.: Sentence parallelism in English–Chinese/Chinese–English: a corpus-based investigation. Foreign Lang. Teach. Res. 6, 410–416 (2003)
Google Scholar
Maguire, P., Wisniewski, E.J., Storms, G.: A corpus study of semantic patterns in compounding. Corpus Linguist. Linguist. Theory 6, 49–73 (2010)
Article Google Scholar
Gahl, S., Cibelli, E., Hall, K., Sprouse, R.: The “Up” corpus: a corpus of speech samples across adulthood. Corpus Linguist. Linguist. Theory 10(2), 315–328 (2014)
Article Google Scholar
De Knop, S., Meunier, F.: The ‘Learner Corpus Research, Cognitive Linguistics and Second Language Acquisition’ nexus: a SWOT analysis. Corpus Linguist. Linguist. Theory 11(1), 1–18 (2015)
Article Google Scholar
Biber, D., Conrad, S., Reppen, R.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge (1998)
Book Google Scholar
Liang, M.: Rationalism, empiricism and corpus linguistics. Foreign Lang. China 4, 90–97 (2010)
Google Scholar
McEnery, T., Hardie, A.: Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, Cambridge (2011)
Book Google Scholar
Anthony, L.: A critical look at software tools in corpus linguistics. Linguist. Res. 30(2), 141–161 (2013)
Article Google Scholar
Casey, R.G., Lecolinet, E.: A survey of methods and strategies in character segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 690–706 (1996)
Article Google Scholar
Dos Santos, R.P., Clemente, G.S., Ren, T.I., Cavalcanti, G.D.: Text line segmentation based on morphology and histogram projection. In: Proceedings of 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 651–655 (2009)
Qiao, S., Zhu, Y., Li, X., Liu, T., Zhang, B.: Research of improving the accuracy of license plate character segmentation. In: Proceedings of the 2010 Fifth International Conference on Frontier of Computer Science and Technology, pp. 489–493 (2010)
Ariyoshi, S.: A character segmentation method for Japanese printed documents coping with touching character problems. In: Proceedings of 11th IAPR International Conference on Pattern Recognition, pp. 313–316 (1992)
Nicolaou, A., Gatos, B.: Handwritten text line segmentation by shredding text into its lines. In: Proceedings of 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 626–630 (2009)
Nikolaou, N., Makridis, M., Gatos, B., Stamatopoulos, N., Papamarkos, N.: Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis. Comput. 28(4), 590–604 (2010)
Article Google Scholar
Kim, H.Y.: Segmentation-free printed character recognition by relaxed nearest neighbor learning of windowed operator. In: Proceedings of Brazilian Symposium on Computer Graphics and Image Processing, pp. 195–204 (1999)
Louloudis, G., Stamatopoulos, N., Gatos, B.: A novel two stage evaluation methodology for word segmentation techniques. In: Proceedings of 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 686–690 (2009)
Yu, Z.: Similarity Measure of Text Images. Master’s thesis, School of Computing, Nat’l Univ. Singapore (2000)
Tan, C.L., Huang, W., Yu, Z., Xu, Y.: Imaged document text retrieval without OCR. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 838–844 (2002)
Article Google Scholar
Tan, C.L., Huang, W., Sung, S.Y., Yu, Z., Xu, Y.: Text retrieval from document images based on word shape analysis. Appl. Intell. 18(3), 257–270 (2003)
Article MATH Google Scholar
Zagoris, K., Ergina, K., Papamarkos, N.: A document image retrieval system. Eng. Appl. Artif. Intell. 23(6), 872–879 (2010)
Article Google Scholar
Zhang, J., Huang, X.L., Lv, H.: Character retrieval based on the improved contour feature and texture feature. In: Proceedings of International Conference on Management and Service Science (MASS), pp. 1–3 (2009)
Xiaogang, Y., Fei, C., Dong, M., Yunpeng, Z.: Study on the image grayscale matching algorithm based on similarity measures. Syst. Eng. Electron. 27(5), 918–921 (2005)
Google Scholar
Bao-sheng, L., Li-ping, Y., Dong-hua, Z.: Comparison of some classical similarity measures. Appl. Res. Comput. 23(11), 1–3 (2006)
Google Scholar
Van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.M.: Distance measures for layout-based document image retrieval. In: Proceedings of Second International Conference on Document Image Analysis for Libraries (DIAL), pp. 11–242 (2006)
Mahalanobis, P.C.: On the generalized distance in statistics. In: Proceedings of Proceedings of the National Institute of Sciences, pp. 49–55 (1936)
Son, H., Kim, S., Kim, J.: Text image matching without language model using a Hausdorff distance. Inf. Process. Manag. 44(3), 1189–1200 (2008)
Article Google Scholar
Abirami, S., Manjula, D.: Profile based information retrieval from printed document images. In: Proceedings of Computer Graphics, Imaging and Visualisation (CGIV), pp. 268–272 (2007)
Samet, H., Tamminen, M.: Efficient component labeling of images of arbitrary dimension represented by linear bintrees. IEEE Trans. Pattern Anal. Mach. Intell. 10(4), 579–586 (1988)
Article Google Scholar
Dillencourt, M.B., Samet, H., Tamminen, M.: A general approach to connected-component labeling for arbitrary image representations. J. ACM 39(2), 253–280 (1992)
Article MathSciNet MATH Google Scholar
Zhu, X., Suk, H., Shen, D.: A novel matrix-similarity based loss function for joint regression and classification in AD diagnosis. NeuroImage 100, 91–105 (2014)
Article Google Scholar

Download references

Acknowledgments

This work was supported by Shaanxi Social Science Foundation of China under Grant Nos. 13K093 and 2015K014, National Social Science Foundation of China under Grant No. 12BYY055, Shaanxi Twelfth Year Planning Foundation of China under Grant No. SGH13015 and the Fundamental Research Funds for the Central Universities under Grant No. Sk2013008.

Author information

Authors and Affiliations

School of Foreign Studies, Xi’an Jiaotong University, Xi’an, 710049, Shaanxi, China
Yuping Lin, Yingyu Li & Fang Wang
School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, 710049, Shaanxi, China
Yonghong Song

Authors

Yuping Lin
View author publications
You can also search for this author inPubMed Google Scholar
Yingyu Li
View author publications
You can also search for this author inPubMed Google Scholar
Yonghong Song
View author publications
You can also search for this author inPubMed Google Scholar
Fang Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yonghong Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, Y., Li, Y., Song, Y. et al. Fast document image comparison in multilingual corpus without OCR. Multimedia Systems 23, 315–324 (2017). https://doi.org/10.1007/s00530-015-0484-3

Download citation

Published: 08 October 2015
Issue Date: June 2017
DOI: https://doi.org/10.1007/s00530-015-0484-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast document image comparison in multilingual corpus without OCR

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multilingual corpus construction based on printed and handwritten character separation

Twofold Detection of Multilingual Documents Using Local Features

Script identification algorithms: a survey

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now