Segmentation of touching characters using an MLP

https://doi.org/10.1016/S0167-8655(98)00048-8Get rights and content

Abstract

In this paper, we propose a method of character extraction from documents in which both Hangul (Korean characters) and alphanumeric characters are written. In order to ensure accurate segmentation of touching characters, character segmentation and recognition are performed by turns. We use a recognizer to select the correct cutting point among candidates generated by the character segmenter. The character segmenter is implemented by a Multi-Layer Perceptron (MLP) trained by a back propagation algorithm. As the MLP has been trained for all kinds of touching characters, it can segment all touching characters by itself. Experimental results show that the proposed method achieves high segmentation rate in documents written in both Hangul and alphanumeric characters.

Introduction

Computer systems can provide the most efficient and convenient method for manipulating and processing information. However, the process of transferring data and text from hard-copy to computer file is time consuming and the manual keying involved is error prone. In recent years, there has been a considerable amount of research done to develop a character recognition system which will facilitate the transfer of information into computer systems without intensive manual keying.

The development of a character recognition system that recognizes well-formed and well-spaced printed texts is a relatively simple process. However, in documents with many touching characters, the recognition rate of the OCR system is considerably lower. A large proportion of the resulting recognition errors are due to segmentation errors. If touching characters are incorrectly segmented, the character recognizer cannot recognize the characters and this error affects succeeding characters as well.

To segment touching characters correctly and to develop an effective recognition system, researchers have attempted to develop new segmentation techniques (Jang et al., 1993; Doh, 1990; E.-J. Kim and T.K. Kim, 1994; Liang et al., 1994; Lee and Lee, 1995). These techniques can be divided into two categories: those which use character width and those which use recognition results.

For techniques which use character width, the widths of all Hangul characters are presumed to be equal and touching characters are cut according to the standard character width (Jang et al., 1993; Doh, 1990; E.-J. Kim and T.K. Kim, 1994). Although the widths of all Hangul characters are similar, when two characters are touched, it is nevertheless difficult to find the correct cutting point. When both Hangul characters and alphanumeric characters are used in a single document, segmentation becomes even more difficult.

In techniques which use recognition results, several candidate cutting points are generated and a recognizer determines the correct cutting point. In the technique used by Liang et al. (1994), every possible combination of characters is considered and then a dictionary is used to select a meaningful word based on the possible combinations. In (Lee and Lee, 1995), a gray scale image is used to generate possible cutting points and a graph search algorithm is used to select correct cutting points. Both of these methods show high segmentation accuracy, however the necessity of processing an extensive amount of data creates considerable overhead.

In this paper, we propose a segmentation method of touching characters in normal documents in which both Hangul and alphanumeric characters are written. In this method, several candidate cutting points are first generated by the MLP-based segmenter. Then, the character recognizer recognizes the character which is cut at each candidate cutting point and tests whether the candidate is correct or not. Fig. 1 shows a diagram of the proposed method.

Section snippets

Touching types

Hangul has a different characteristics with English or Chinese (P.K. Kim and H.J. Kim, 1994). To make a good recognition system, we need to know the characteristics of target characters. Hangul consist of 24 graphemes. Ten of the graphemes are vowels and the rest are consonants. Characters are composed of 1 to 4 consonants and 1 to 3 vowels. The number of all possible Hangul characters is 11 172, but that of actually used is under 3000.

There are many kinds of touching types commonly found in

Structure of the Multi-Layer Perceptron (MLP)

Fig. 3 shows the structure of the MLP. The MLP has 72 input nodes, 70 hidden nodes and 60 output nodes. It has a fully-connected structure and uses a back-propagation learning algorithm. Since each output node corresponds to the column of the input image, the output value represents the degree of being cut.

The MLP inputs a 72-order mesh vector to the network, which is extracted from a 30 × 60 normalized character image. The 72 integer values are obtained by counting the number of pixels in each

Experimental results and analysis

The system is implemented in the C language on an IBM-PC. 150 pages of a document were scanned using HP flatbed scanner at 300 dpi. 100 pages were used to train the character segmentation MLP. The remaining 50 pages were used for testing. Documents used in this experiment contain about 20% touching characters. The segmentation accuracy is 92.2% for touching characters only and 99.2% for all characters in the document. Table 2 shows proportions of each touching type in the document and the

Conclusion

This paper has proposed a method for extracting characters from a document written in both Hangul and alphanumeric characters. In this method, the touching character segmenter is implemented by a Multi-Layer Perceptron trained by a back-propagation algorithm. The MLP-based character segmenter generates several candidate cutting points and the character recognizer recognizes these segmented images in order to select the correct cutting point among candidates. Through experiments with various

References (12)

There are more references available in the full text version of this article.

Cited by (28)

  • Improving handwritten Chinese text recognition using neural network language models and convolutional neural network shape models

    2017, Pattern Recognition
    Citation Excerpt :

    The methods of [39,40], by combining traditional feature extraction methods such as Gabor and gradient feature maps with deep CNN, also obtained very high recognition accuracies. Learning based over-segmentation has been explored for decades, and has achieved great success in separating characters with high recall rate [10,43,44]. The method referred to as GraySeg [10] combines the output of a sliding window classifier and boundaries of connected components (CCs) for over-segmentation, and has led to superior text recognition performance on public benchmark datasets.

  • Recognition-based gesture spotting in video games

    2004, Pattern Recognition Letters
  • Font classification using NMF with hierarchical clustering

    2005, International Journal of Pattern Recognition and Artificial Intelligence
  • Takri touching text segmentation using statistical approach

    2023, Sadhana - Academy Proceedings in Engineering Sciences
View all citing articles on Scopus
View full text