research-article

Generation of training database using a noise model for OCR systems

Authors:

Do Yen,

Ha Le,

In Seop NaAuthors Info & Claims

ICUIMC '12: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication

Article No.: 95, Pages 1 - 6

https://doi.org/10.1145/2184751.2184862

Published: 20 February 2012 Publication History

Get Access

Abstract

In this paper, we present a noise model for generating synthetic character databases to train Optical Character Recognition (OCR) systems. Nowadays, the emergence of new font typefaces requires an imperative task to automatically and rapidly generate synthetic training character databases. In addition, since the accuracy of the OCR systems deeply depends on the number of training samples, a lot of character samples should be generated to retrain OCR systems. However, it is time consuming and laborious to achieve a large size of training samples from real images. Therefore, we develop a noise model to automatically generate synthetic character images in such a way that are very lifelike, without any miserable process of getting images in real life, such as printing, scanning, copying and so on. First, our system generates digital character images. After that, pepper noise, scale noise, and other kind of noises are superimposed to the character images. Since the shape of characters may be distorted through real processing steps, some geometric transformations are applied to the images to mimic this characteristic. As we measure the OCR accuracies, we have observed that the quality of training data obtained either from real world data or by our noise model are comparable. Thus, we believe that using our noise model is a convenient and appropriate way for generating synthetic database to train OCR systems.

References

[1]

H. Baird. State of the art of document image degradation modeling. In Proc. 4th IAPR Workshop on Document Analysis Systems (DAS 2000), Invited plenary talk, Rio de Janeiro, Brazil, December 2000. http://www.cse.lehigh.edu/~baird/Pubs/das00.pdf.

Google Scholar

[2]

T. Varga and Bunke H. Effects of training set expansion in handwriting recognition using synthetic data. In 11th Conf. of the International Graphonomics Society, pages 200--203, Scottsdale, Arizona, USA, 2003. http://www.iam.unibe.ch/~varga/publications/igs2003.pdf

Google Scholar

[3]

D. R. J. et al. Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems. In 11th Int. Conf. on Knowledge Discovery in Data mining, pages 756--762, 2005. http://dl.acm.org/citation.cfm?id=1081969

Digital Library

Google Scholar

[4]

D. Doermann, and S. Yao, "Generating Synthetic Data for Text Analysis Systems", SDAIR, 1995, pp. 449--467 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.9370

Google Scholar

[5]

Ibrahim S. I. Abuhaiba, "Training OCR Systems Using Variants of Ideal Images," GVIP 05 Conference, December 2005. http://deplibrary.iugaza.edu.ps/en/ViewPaper.aspx?id=423

Google Scholar

[6]

S. I. Jang and Y. S. Nam A Method of Machine-Printed Hangul Recognition using Grapheme Recognizer. Proc. of Korea Information Processing Society Spring Conference, vol. 11, no. 1, pp.351--354, 2004.

Google Scholar

[7]

Yanhong Li, Daniel Lopresti, George Nagy, Andrew Tomkins, "Validation of image Defect Models for Optical Character Recognition", IEEE Transactions on Patern Analysis and Machine Intelligent, February, 1996. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=481536

Digital Library

Google Scholar

[8]

Rafael C. Gomzalez, "Digital Image Processing", Second Edition, Prentice Hall, 2002. http://books.google.com/books?id=738oAQAAMAAJ&dq=Digital%20Image%20Processing&source=gbs_book_other_versions

Google Scholar

[9]

V. Margner and M. Pechwitz, "Synthetic Data for Arabic OCR System Development," International Conference on Document Analysis and Recognition, pp. 1159--1163, September 2009. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00953967

Digital Library

Google Scholar

Index Terms

Generation of training database using a noise model for OCR systems
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

OCR for printed Kannada text to machine editable format using database approach

This paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The ...
OCR for printed Kannada text to machine editable format using database approach
ICAI'08: Proceedings of the 9th WSEAS International Conference on International Conference on Automation and Information

This paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The ...
Hybrid model for Chinese character recognition based on Tesseract-OCR

Optical character recognition (OCR) is an important way to input information into a computer. And text information can be extracted by OCR from an image. Currently, the accuracy rate of Chinese OCR can also be improved. This study proposes a hybrid ...

Comments

Information & Contributors

Information

Published In

ICUIMC '12: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication

February 2012

852 pages

ISBN:9781450311724

DOI:10.1145/2184751

Conference Chairs:
Suk-Han Lee
Sungkyunkwan University, Korea
,
Lajos Hanzo
University of Southampton, UK
,
Roslan Ismail
Universiti Kuala Lumpur, Malaysia
,
Program Chairs:
Dongsoo S. Kim
Indiana University
,
Min Young Chung
Sungkyunkwan University, Korea
,
Sang-Won Lee
Sungkyunkwan University, Korea

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministry of Education, Science and Technology

Conference

ICUIMC '12

Sponsor:

SIGAPP
SKKU

ICUIMC '12: The 6th International Conference on Ubiquitous Information Management and Communication

February 20 - 22, 2012

Kuala Lumpur, Malaysia

Acceptance Rates

Overall Acceptance Rate 251 of 941 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
150
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Index Terms

Recommendations

OCR for printed Kannada text to machine editable format using database approach

OCR for printed Kannada text to machine editable format using database approach

Hybrid model for Chinese character recognition based on Tesseract-OCR

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations