skip to main content
10.1145/2184751.2184862acmconferencesArticle/Chapter ViewAbstractPublication PagesicuimcConference Proceedingsconference-collections
research-article

Generation of training database using a noise model for OCR systems

Published: 20 February 2012 Publication History

Abstract

In this paper, we present a noise model for generating synthetic character databases to train Optical Character Recognition (OCR) systems. Nowadays, the emergence of new font typefaces requires an imperative task to automatically and rapidly generate synthetic training character databases. In addition, since the accuracy of the OCR systems deeply depends on the number of training samples, a lot of character samples should be generated to retrain OCR systems. However, it is time consuming and laborious to achieve a large size of training samples from real images. Therefore, we develop a noise model to automatically generate synthetic character images in such a way that are very lifelike, without any miserable process of getting images in real life, such as printing, scanning, copying and so on. First, our system generates digital character images. After that, pepper noise, scale noise, and other kind of noises are superimposed to the character images. Since the shape of characters may be distorted through real processing steps, some geometric transformations are applied to the images to mimic this characteristic. As we measure the OCR accuracies, we have observed that the quality of training data obtained either from real world data or by our noise model are comparable. Thus, we believe that using our noise model is a convenient and appropriate way for generating synthetic database to train OCR systems.

References

[1]
H. Baird. State of the art of document image degradation modeling. In Proc. 4th IAPR Workshop on Document Analysis Systems (DAS 2000), Invited plenary talk, Rio de Janeiro, Brazil, December 2000. http://www.cse.lehigh.edu/~baird/Pubs/das00.pdf.
[2]
T. Varga and Bunke H. Effects of training set expansion in handwriting recognition using synthetic data. In 11th Conf. of the International Graphonomics Society, pages 200--203, Scottsdale, Arizona, USA, 2003. http://www.iam.unibe.ch/~varga/publications/igs2003.pdf
[3]
D. R. J. et al. Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems. In 11th Int. Conf. on Knowledge Discovery in Data mining, pages 756--762, 2005. http://dl.acm.org/citation.cfm?id=1081969
[4]
D. Doermann, and S. Yao, "Generating Synthetic Data for Text Analysis Systems", SDAIR, 1995, pp. 449--467 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.9370
[5]
Ibrahim S. I. Abuhaiba, "Training OCR Systems Using Variants of Ideal Images," GVIP 05 Conference, December 2005. http://deplibrary.iugaza.edu.ps/en/ViewPaper.aspx?id=423
[6]
S. I. Jang and Y. S. Nam A Method of Machine-Printed Hangul Recognition using Grapheme Recognizer. Proc. of Korea Information Processing Society Spring Conference, vol. 11, no. 1, pp.351--354, 2004.
[7]
Yanhong Li, Daniel Lopresti, George Nagy, Andrew Tomkins, "Validation of image Defect Models for Optical Character Recognition", IEEE Transactions on Patern Analysis and Machine Intelligent, February, 1996. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=481536
[8]
Rafael C. Gomzalez, "Digital Image Processing", Second Edition, Prentice Hall, 2002. http://books.google.com/books?id=738oAQAAMAAJ&dq=Digital%20Image%20Processing&source=gbs_book_other_versions
[9]
V. Margner and M. Pechwitz, "Synthetic Data for Arabic OCR System Development," International Conference on Document Analysis and Recognition, pp. 1159--1163, September 2009. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00953967

Index Terms

  1. Generation of training database using a noise model for OCR systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICUIMC '12: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
    February 2012
    852 pages
    ISBN:9781450311724
    DOI:10.1145/2184751
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 February 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. envelope images
    2. noise model
    3. optical character recognition
    4. training sample generation

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICUIMC '12
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 251 of 941 submissions, 27%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 150
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media