skip to main content
10.1145/2505377.2505382acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmocrConference Proceedingsconference-collections
research-article

HMM-based script identification for OCR

Published: 24 August 2013 Publication History

Abstract

While current OCR systems are able to recognize text in an increasing number of scripts and languages, typically they still need to be told in advance what those scripts and languages are. We propose an approach that repurposes the same HMM-based system used for OCR to the task of script/language ID, by replacing character labels with script class labels. We apply it in a multi-pass overall OCR process which achieves "universal" OCR over 54 tested languages in 18 distinct scripts, over a wide variety of typefaces in each. For comparison we also consider a brute-force approach, wherein a singe HMM-based OCR system is trained to recognize all considered scripts. Results are presented on a large and diverse evaluation set extracted from book images, both for script identification accuracy and for overall OCR accuracy. On this evaluation data, the script ID system provided a script ID error rate of 1.73% for 18 distinct scripts. The end-to-end OCR system with the script ID system achieved a character error rate of 4.05%, an increase of 0.77% over the case where the languages are known a priori.

References

[1]
Speech language & multimedia: Optical character recognition. http://www.bbn.com/technology/speech/optical_character_recognition.
[2]
Unicode standard annex 38: Unicode han database (UNIHAN). Technical report, The Unicode Consortium, 2013. http://www.unicode.org/reports/tr38/tr38-13.html.
[3]
Abbyy. ABBYY FineReader professional edition recognition languages. http://finereader.abbyy.com/recognition languages.
[4]
Abbyy. ABBYY FineReader version 11 user's guide. http://www.abbyy.com/fr11guide_en.pdf.
[5]
I. Bazzi, C. LaPre, J. Makhoul, C. Raphael, and R. M. Schwartz. Omnifont and unlimited-vocabulary OCR for english and arabic. In ICDAR '97: Proceedings of the 4th International Conference on Document Analysis and Recognition, pages 842--846, Washington, DC, USA, 1997. IEEE Computer Society.
[6]
T. M. Breuel. The OCRopus open source document analysis and OCR system. https://code.google.com/p/ocropus.
[7]
A. Busch, W. Boles, and S. Sridharan. Texture for script identification. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(11):1720--1732, 2005.
[8]
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems 25, 2012.
[9]
P. Dreuw, G. Heigold, and H. Ney. Confidence and margin-based MMI/MPE discriminative training for offline handwriting recognition. International Journal on Document Analysis and Recognition, 2011.
[10]
D. Genzel, A. C. Popat, N. Spasojevic, M. Jahr, A. W. Senior, E. Ie, and F. Y.-F. Tang. Translation-inspired OCR. In ICDAR, pages 1339--1343, 2011.
[11]
P. Hiremath and S. Shivashankar. Wavelet based co-occurrence histogram features for texture classification with an application to script identification in a document image. Pattern Recognition Letters, 29(9):1182--1189, 2008.
[12]
P. B. Pati and A. G. Ramakrishnan. Word level multi-script identification. Pattern Recognition Letters, 29:1218--1229, 2008.
[13]
R. Smith. Tesseract manual page. http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html.

Cited By

View all
  • (2022)Converting high resolution multi-lingual printed document images in to editable text using image processing and artificial intelligence2022 2nd International Conference on Image Processing and Robotics (ICIPRob)10.1109/ICIPRob54042.2022.9798739(1-7)Online publication date: 12-Mar-2022
  • (2019)Hybrid HMM/BLSTM system for multi-script keyword spotting in printed and handwritten documents with identification stageNeural Computing and Applications10.1007/s00521-019-04429-wOnline publication date: 28-Aug-2019
  • (2018)[Invited] Optical Character Recognition Research at Google2018 IEEE 7th Global Conference on Consumer Electronics (GCCE)10.1109/GCCE.2018.8574624(265-266)Online publication date: Oct-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR
August 2013
99 pages
ISBN:9781450321143
DOI:10.1145/2505377
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • BBN Technologies

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2013

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

MOCR '13
Sponsor:

Acceptance Rates

MOCR '13 Paper Acceptance Rate 17 of 34 submissions, 50%;
Overall Acceptance Rate 17 of 34 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Converting high resolution multi-lingual printed document images in to editable text using image processing and artificial intelligence2022 2nd International Conference on Image Processing and Robotics (ICIPRob)10.1109/ICIPRob54042.2022.9798739(1-7)Online publication date: 12-Mar-2022
  • (2019)Hybrid HMM/BLSTM system for multi-script keyword spotting in printed and handwritten documents with identification stageNeural Computing and Applications10.1007/s00521-019-04429-wOnline publication date: 28-Aug-2019
  • (2018)[Invited] Optical Character Recognition Research at Google2018 IEEE 7th Global Conference on Consumer Electronics (GCCE)10.1109/GCCE.2018.8574624(265-266)Online publication date: Oct-2018
  • (2017)Sequence-to-Label Script Identification for Multilingual OCR2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2017.35(161-168)Online publication date: Nov-2017
  • (2017)A HMM-Based Arabic/Latin Handwritten/Printed Identification SystemProceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016)10.1007/978-3-319-52941-7_30(298-307)Online publication date: 23-Feb-2017
  • (2015)A sequence learning approach for multiple script identificationProceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2015.7333921(1046-1050)Online publication date: 23-Aug-2015
  • (2015)Label transition and selection pruning and automatic decoding parameter optimization for time-synchronous Viterbi decodingProceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2015.7333863(756-760)Online publication date: 23-Aug-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media