Segmentation of Mixed Chinese/English Document Including Scattered Italic Characters

Xia, Yong; Wang, Chun-Heng; Dai, Ru-Wei

doi:10.1007/11940098_2

Yong Xia²²,
Chun-Heng Wang²² &
Ru-Wei Dai²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4285))

Included in the following conference series:

International Conference on Computer Processing of Oriental Languages

1008 Accesses
1 Citations

Abstract

It is difficult to segment mixed Chinese/English documents when there are many italic characters scattered in documents. Most contributions attach more attention to English documents. However, mixed document is different from English document and some special features should be considered. This paper gives a new way to solve the problem. At first, an appropriate character area is chosen to detect italic. Next, a two-step strategy is adopted. Italic determination is done first and then if the character pattern is identified as italic, the estimation of slant angle will be done. Finally the italic character pattern is corrected by shear transform. A method of adopting two-step weighted projection profile histogram for italic determination is introduced. And a fast algorithm to estimate slant angle is also introduced. Three large sample collections, including character and character-pair and document respectively, are provided to evaluate our method and encouraging results are achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ding, Y.M., Okada, M., Kimura, F., Miyake, Y.: Application of Slant Correction to Handwritten Japanese Address Recognition. In: Proceedings of the Sixth International Conference on Document Analysis and Recognition, pp. 670–674 (2001)
Google Scholar
Ding, Y.M., Kimura, F., Miyake, Y., Shridhar, M.: Slant estimation for handwritten words by directionally refined chain code. In: Proceedings of the Seventh International Workshop on Frontiers in Handwritten Recognition, pp. 53–62 (2000)
Google Scholar
Ding, Y.M., Ohyama, W., Kimura, F., Shridhar, M.: Local slant estimation for handwritten English words. In: Proceedings of the Ninth International Workshop on Frontiers in Handwritten Recognition, Kokubunji, Tokyo, Japan, pp. 328–333 (2004)
Google Scholar
Simoncini, L., Kovacs-V, Z.M.: A system for reading USA census ’90 hand-written fields. In: Proceedings of the Third International Conference on Document Analysis and Recognition, Montreal, vol. 1, pp. 86–91 (1995)
Google Scholar
Nicchiotti, G., Scagliola, C.: Generalised projections: a tool for cursive character normalization. In: Proceedings of Fifth International Conference on Document Analysis and Recognition, Bangalore (1999)
Google Scholar
Fan, K.C., Huang, C.H., Chuang, T.C.: Italic Detection and Rectification. In: Proceedings of 2005 International Conference on Image Processing, vol. 2, pp. 530–533 (2005)
Google Scholar
Li, Y., Naoi, S., Cheriet, M., Suen, C.Y.: A segmentation method for touching italic characters. In: Proceedings of Seventeenth International Conference on Pattern Recognition, vol. 2, pp. 594–597 (2004)
Google Scholar
Su, L.: Restoration and segmentation of machine printed documents, Ph.D dissertation, University of Windsor, Canada, pp. 92–95 (1996)
Google Scholar
Sun, C.M., Si, D.: Skew and slant correction for document images using gradient direction. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition, vol. 1, pp. 142–146 (1997)
Google Scholar
Ballesteros, J., Travieso, C.M., Alonso, J.B., Ferrer, M.A.: Slant estimation of handwritten characters by means of Zernike moments. Electronics Letters 41(20), 1110–1112 (2005)
Article Google Scholar
Chaudhuri, B.B., Garain, U.: Automatic detection of italic bold and all-capital words in document images. In: Proceedings of Fourteenth International Conference on Pattern Recognition, vol. 1, pp. 610–612 (1998)
Google Scholar
Kavallieratou, E., Fakotakis, N., Kokkinakis, G.: Slant estimation algorithm for OCR system. Pattern Recognition 34(12), 2515–2522 (2001)
Article MATH Google Scholar
Zhang, L., Lu, Y., Tan, C.L.: Italic font recognition using stroke pattern analysis on wavelet decomposed word images. In: Proceedings of Seventeenth International Conference on Pattern Recognition, vol. 4, pp. 835–838 (2004)
Google Scholar
Bozinovic, R.M., Srihari, S.N.: Off-line cursive script word recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(1), 68–83 (1989)
Article Google Scholar
Xia, Y., Wang, C.H., Dai, R.W.: Segmentation of mixed Chinese/English document based on AFMPF model. Acta Automatica Sinica 32(3), 353–359 (2006)
Google Scholar
Xia, Y., Xiao, B.H., Wang, C.H., Li, Y.D.: Segmentation of mixed Chinese/English documents based on Chinese Radicals recognition and complexity analysis in local segment pattern. Lecture Notes in Control and Information Sciences, vol. 345, pp. 497–506. Springer, Heidelberg (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Complex System and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing, 100080, China
Yong Xia, Chun-Heng Wang & Ru-Wei Dai

Authors

Yong Xia
View author publications
You can also search for this author in PubMed Google Scholar
Chun-Heng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ru-Wei Dai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Information Science, Nara Institute of Science and Technology, 630-0192, Takayama, Ikoma, Nara, Japan
Yuji Matsumoto
Dept of ECE, University of Illinois at Urbana Champaign, IL 61801, Urbana, USA
Richard W. Sproat
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Kam-Fai Wong
State Key Lab of Intelligent Tech. & Sys., Tsinghua University,
Min Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xia, Y., Wang, CH., Dai, RW. (2006). Segmentation of Mixed Chinese/English Document Including Scattered Italic Characters. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_2

Download citation

DOI: https://doi.org/10.1007/11940098_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49667-0
Online ISBN: 978-3-540-49668-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics