research-article

HMM-based script identification for OCR

Authors:

Dmitriy Genzel,

Ashok C. Popat,

Remco Teunen,

Yasuhisa FujiiAuthors Info & Claims

MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR

Article No.: 2, Pages 1 - 5

https://doi.org/10.1145/2505377.2505382

Published: 24 August 2013 Publication History

Get Access

Abstract

While current OCR systems are able to recognize text in an increasing number of scripts and languages, typically they still need to be told in advance what those scripts and languages are. We propose an approach that repurposes the same HMM-based system used for OCR to the task of script/language ID, by replacing character labels with script class labels. We apply it in a multi-pass overall OCR process which achieves "universal" OCR over 54 tested languages in 18 distinct scripts, over a wide variety of typefaces in each. For comparison we also consider a brute-force approach, wherein a singe HMM-based OCR system is trained to recognize all considered scripts. Results are presented on a large and diverse evaluation set extracted from book images, both for script identification accuracy and for overall OCR accuracy. On this evaluation data, the script ID system provided a script ID error rate of 1.73% for 18 distinct scripts. The end-to-end OCR system with the script ID system achieved a character error rate of 4.05%, an increase of 0.77% over the case where the languages are known a priori.

References

[1]

Speech language & multimedia: Optical character recognition. http://www.bbn.com/technology/speech/optical_character_recognition.

Google Scholar

[2]

Unicode standard annex 38: Unicode han database (UNIHAN). Technical report, The Unicode Consortium, 2013. http://www.unicode.org/reports/tr38/tr38-13.html.

Google Scholar

[3]

Abbyy. ABBYY FineReader professional edition recognition languages. http://finereader.abbyy.com/recognition languages.

Google Scholar

[4]

Abbyy. ABBYY FineReader version 11 user's guide. http://www.abbyy.com/fr11guide_en.pdf.

Google Scholar

[5]

I. Bazzi, C. LaPre, J. Makhoul, C. Raphael, and R. M. Schwartz. Omnifont and unlimited-vocabulary OCR for english and arabic. In ICDAR '97: Proceedings of the 4th International Conference on Document Analysis and Recognition, pages 842--846, Washington, DC, USA, 1997. IEEE Computer Society.

Digital Library

Google Scholar

[6]

T. M. Breuel. The OCRopus open source document analysis and OCR system. https://code.google.com/p/ocropus.

Google Scholar

[7]

A. Busch, W. Boles, and S. Sridharan. Texture for script identification. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(11):1720--1732, 2005.

Digital Library

Google Scholar

[8]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems 25, 2012.

Digital Library

Google Scholar

[9]

P. Dreuw, G. Heigold, and H. Ney. Confidence and margin-based MMI/MPE discriminative training for offline handwriting recognition. International Journal on Document Analysis and Recognition, 2011.

Digital Library

Google Scholar

[10]

D. Genzel, A. C. Popat, N. Spasojevic, M. Jahr, A. W. Senior, E. Ie, and F. Y.-F. Tang. Translation-inspired OCR. In ICDAR, pages 1339--1343, 2011.

Digital Library

Google Scholar

[11]

P. Hiremath and S. Shivashankar. Wavelet based co-occurrence histogram features for texture classification with an application to script identification in a document image. Pattern Recognition Letters, 29(9):1182--1189, 2008.

Digital Library

Google Scholar

[12]

P. B. Pati and A. G. Ramakrishnan. Word level multi-script identification. Pattern Recognition Letters, 29:1218--1229, 2008.

Digital Library

Google Scholar

[13]

R. Smith. Tesseract manual page. http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html.

Google Scholar

Cited By

View all

Premachandra HJayakody AKawanaka H(2022)Converting high resolution multi-lingual printed document images in to editable text using image processing and artificial intelligence2022 2nd International Conference on Image Processing and Robotics (ICIPRob)10.1109/ICIPRob54042.2022.9798739(1-7)Online publication date: 12-Mar-2022
https://doi.org/10.1109/ICIPRob54042.2022.9798739
Cheikhrouhou AKessentini YKanoun S(2019)Hybrid HMM/BLSTM system for multi-script keyword spotting in printed and handwritten documents with identification stageNeural Computing and Applications10.1007/s00521-019-04429-wOnline publication date: 28-Aug-2019
https://doi.org/10.1007/s00521-019-04429-w
Fujii Y(2018)[Invited] Optical Character Recognition Research at Google2018 IEEE 7th Global Conference on Consumer Electronics (GCCE)10.1109/GCCE.2018.8574624(265-266)Online publication date: Oct-2018
https://doi.org/10.1109/GCCE.2018.8574624
Show More Cited By

HMM-based script identification for OCR
1. Applied computing
  1. Document management and text processing
    1. Document capture
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

A Complete OCR System for Gurmukhi Script
Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition

Recognition of Indian language scripts is a challenging problem. Work for the development of complete OCR systems for Indian language scripts is still in infancy. Complete OCR systems have recently been developed for Devanagri and Bangla scripts. ...
Word-Wise Thai and Roman Script Identification

In some Thai documents, a single text line of a printed document page may contain words of both Thai and Roman scripts. For the Optical Character Recognition (OCR) of such a document page it is better to identify, at first, Thai and Roman script ...
Comparison of HMM- and SVM-based stroke classifiers for Gurmukhi script

With the evolution of touch-based devices, development of handwriting recognition systems has received attention from many researchers. An online handwriting recognition system for Gurmukhi script is proposed in this paper. In this work, 74 stroke ...

Comments

Information & Contributors

Information

Published In

MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR

August 2013

99 pages

ISBN:9781450321143

DOI:10.1145/2505377

General Chairs:
Venu Govindaraju
University at Buffalo
,
Prem Natarajan
Information Sciences Institute
,
Santanu Chaudhury
IIT Delhi, India
,
Daniel Lopresti
Lehigh University
,
Program Chairs:
Srirangaraj Setlur
University at Buffalo
,
Huaigu Cao
Raytheon BBN Technologies

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

MOCR '13

Sponsor:

MOCR '13: 4th International Workshop on Multilingual OCR

August 24, 2013

D.C., Washington, USA

Acceptance Rates

MOCR '13 Paper Acceptance Rate 17 of 34 submissions, 50%;

Overall Acceptance Rate 17 of 34 submissions, 50%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
267
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Premachandra HJayakody AKawanaka H(2022)Converting high resolution multi-lingual printed document images in to editable text using image processing and artificial intelligence2022 2nd International Conference on Image Processing and Robotics (ICIPRob)10.1109/ICIPRob54042.2022.9798739(1-7)Online publication date: 12-Mar-2022
https://doi.org/10.1109/ICIPRob54042.2022.9798739
Cheikhrouhou AKessentini YKanoun S(2019)Hybrid HMM/BLSTM system for multi-script keyword spotting in printed and handwritten documents with identification stageNeural Computing and Applications10.1007/s00521-019-04429-wOnline publication date: 28-Aug-2019
https://doi.org/10.1007/s00521-019-04429-w
Fujii Y(2018)[Invited] Optical Character Recognition Research at Google2018 IEEE 7th Global Conference on Consumer Electronics (GCCE)10.1109/GCCE.2018.8574624(265-266)Online publication date: Oct-2018
https://doi.org/10.1109/GCCE.2018.8574624
Fujii YDriesen KBaccash JHurst APopat A(2017)Sequence-to-Label Script Identification for Multilingual OCR2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2017.35(161-168)Online publication date: Nov-2017
https://doi.org/10.1109/ICDAR.2017.35
Cheikh Rouhou AAbdelhedi ZKessentini Y(2017)A HMM-Based Arabic/Latin Handwritten/Printed Identification SystemProceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016)10.1007/978-3-319-52941-7_30(298-307)Online publication date: 23-Feb-2017
https://doi.org/10.1007/978-3-319-52941-7_30
Ul-Hasan AAfzal MShafait FLiwicki MBreuel T(2015)A sequence learning approach for multiple script identificationProceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2015.7333921(1046-1050)Online publication date: 23-Aug-2015
https://dl.acm.org/doi/10.1109/ICDAR.2015.7333921
Fujii YGenzel DPopat ATeunen R(2015)Label transition and selection pruning and automatic decoding parameter optimization for time-synchronous Viterbi decodingProceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)10.1109/ICDAR.2015.7333863(756-760)Online publication date: 23-Aug-2015
https://dl.acm.org/doi/10.1109/ICDAR.2015.7333863

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Recommendations

A Complete OCR System for Gurmukhi Script

Word-Wise Thai and Roman Script Identification

Comparison of HMM- and SVM-based stroke classifiers for Gurmukhi script

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations