Maximum Entropy Regularization and Chinese Text Recognition

Cheng, Changxu; Xu, Wuheng; Bai, Xiang; Feng, Bin; Liu, Wenyu

doi:10.1007/978-3-030-57058-3_1

Changxu Cheng¹¹,
Wuheng Xu¹¹,
Xiang Bai¹¹,
Bin Feng¹¹ &
…
Wenyu Liu¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12116))

Included in the following conference series:

International Workshop on Document Analysis Systems

1663 Accesses

Abstract

Chinese text recognition is more challenging than Latin text due to the large amount of fine-grained Chinese characters and the great imbalance over classes, which causes a serious overfitting problem. We propose to apply Maximum Entropy Regularization to regularize the training process, which is to simply add a negative entropy term to the canonical cross-entropy loss without any additional parameters and modification of a model. We theoretically give the convergence probability distribution and analyze how the regularization influence the learning process. Experiments on Chinese character recognition, Chinese text line recognition and fine-grained image classification achieve consistent improvement, proving that the regularization is beneficial to generalization and robustness of a recognition model.

C. Cheng and W. Xu—Equal contribution.

You have full access to this open access chapter, Download conference paper PDF

Discrete representation learning for handwritten text recognition

Article Open access 18 April 2023

TextAdaIN: Paying Attention to Shortcut Learning in Text Recognizers

A comprehensive study of hybrid neural network hidden Markov model for offline handwritten Chinese text recognition

Article 15 June 2018

Keywords

1 Introduction

Text recognition is a popular topic in the deep learning community. Most of the existing deep learning-based works [13, 15, 20, 22,23,24] pay attention to Latin script and achieve good performance.

However, Chinese text differs much from Latin text. There are thousands of common Chinese characters appearing as various styles. Chinese text recognition can be regarded as a kind of fine-grained image classification due to the high inter-class similarity and the large intra-class variance, as shown in Fig. 1(a). Besides, there is usually a great data imbalance over character classes [32, 33]. These features cause a large demand for training data. Thus character-based recognition model is prone to overfit. Radical-based methods [16, 29, 30, 35] convert Chinese characters to radicals as the basic class to simplify the character structure and decrease the number of classes, thus reducing the demand for training data. They are mentioned here to prove the large demand for data if we use character-based recognition. Nevertheless, they are not flexible enough for various circumstances, e.g., handwritten text [37]. And current methods in that way cost much time in practice due to the use of RNN-based decoder.

A common practice to deal with the overfitting problem is to regularize the model during training. There are several techniques aiming at this, including dropout [25], L2 regularization, batch normalization [10]. They act on model parameters or hidden units like a blackbox. We consider regularization from the perspective of entropy. The predicted probability distribution are an indication of how the network generalizes [19]. In Chinese character recognition, we hope that a similar negative class is assigned a larger probability than the dissimilar one given an input image. It requires the probability of positive class to be not that large to leave probability space for other negative classes, which causes a big entropy. This is okay because we can recognize correctly as long as the probability of the positive class is the largest. The maximum entropy principle [12] also points out that the model with the largest entropy of output distribution can represent features best. Figure 1(b) illustrates the idea by comparing the 2 probability distributions with high and low entropy respectively. Hence we adopt Maximum Entropy Regularization [19] to regularize the training process.

In this paper, we perform an in-depth analysis of Maximum Entropy Regularization theoretically. The cross-entropy loss and the negative entropy term behave like two 1-dimensional forces which function on the output probability distribution. The elegant gradient function illustrates the regularized behaviour from the perspective of backward propagation. Under an assumption similar to label smoothing, we formulate the convergence probability distribution in training set, which is exactly a relationship between the convergence probability of positive class and the coefficience.

We conduct experiments on Chinese character recognition, Chinese text line recognition and fine-grained image classification and gain consistent improvement. In addition, we find that model trained with MER can attend on more compact and discriminative regions and filter much noisy area. MER also makes model more robust when label corruption is exerted to our training data.

2 Related Works

Our work focuses on model regularization and its application to Chinese text recognition. Here we briefly review some recent works about these two aspects.

2.1 Model Regularization

Large deep neural network is prone to overfit in many cases. To relieve the problem, model regularization is commonly used during training. Dropout [25] is to randomly drop some neurons or connections with a certain probability in layers. L2 regularization is also called weight decay, which restricts the magnitude of model weights. Batch normalization [10] normalizes hidden units in a training batch to reduce internal covariate shift. These methods act on model parameters or hidden layers, which is hard to control and not intuitive. Mixup [34] simply uses linear operation on both input images and their labels with an assumption that linear interpolations of features should lead to linear interpolations of the associated targets.

Recently, output distribution of neural network has earned much attention. Knowledge distillation [8] is proposed to train a small-size model to have a similar output distribution to a large model since the “soft targets” can transfer the generalization ability, which indicates the effect of output distribution. Label smoothing [27] is proposed to encourage the prediction to be less confident by disturbing the one-hot ground truth label with a uniform distribution, which actually adds a KL-divergence (between the uniform and the output distribution) term to the cross-entropy loss in the view of loss function. Label smoothing is able to improve generalization and model calibration, and benefit distillation [17]. Softermax loss [1] prompts the summation of top-k output probabilities to be as great as possible to alleviate the extreme confidence caused by cross-entropy loss. But the hyperparameter k is hard to choose. Bootstrapping [21] aims at leveraging ground truth distribution and output distribution as expected distribution, thus the model can train well on dataset with noisy label [3]. But the output entropy is still as low as that with cross-entropy loss. According to maximum entropy principle [12], the model whose probability distribution represents the current state of knowledge best is the one with the largest entropy. Correspondingly, a maximum-entropy based method, called confidence penalty [19], includes a negative entropy term in loss function, which acts similarly to label smoothing but performs better. As a result, it has attracted researcher’s interest in applying it to multiple tasks, e.g., sequence modeling [14], named entity recognition [31] and fine-grained image classification [2]. CTC [5]-based sequence model trys to penalizes peaky distributions by using maximum-entropy based regularization [14]. However, none of these works make a deep analysis on the regularization term or the whole loss function theoretically or experimentally.

2.2 Chinese Text Recognition

Chinese text recognition is more challenging than Latin script due to the more character categories and the more complicate layout. There are usually two tasks derived from Chinese text recognition: Chinese character recognition (CCR) and Chinese text line recognition (CLR).

As for Chinese character, recent methods can be divided into two streams: character-based CCR (CCCR) [39] and radical-based CCR (RCCR) [35]. Taken as a single class, every Chinese character is well classified with deep learning [36]. However, CCCR has no capability to handle unseen characters. By considering the structure of Chinese character, RCCR methods exploit radicals to represent a character [16, 29, 30, 35]. Multi-label learning has been used to detect radicals [29]. Radical analysis network (RAN) [35] takes the spatial structure of a single Chinese character as a radical sequence and decodes with an attention-based RNN. JSRAN [30] improves RAN by jointly using STN [11] and RAN. However, RAN is very time-consuming during inference, and RCCR is hard to tackle some nonstandard handwritten text. In this paper, we choose CCCR and use SE-ResNet-50 [9] as backbone for CCR.

As for Chinese text line, currently there are also two streams: one is Convolutional Recurrent Neural Network (CRNN) [23] with CTC [5], the other is attention-based Encoder-Decoder [24]. The vanilla version of them can only process image one-dimensionally. 2D-attention [13, 15] decodes an encoded text image from two-dimensional perspective, which is an extension of the latter. In our experiments, we simply use 1-d attention-based Encoder-Decoder for CLR.

3 Analysis on Maximum Entropy Regularization

Unlike the previous works which only analyze the entropy term, we study both the term and the complete loss function to discover the joint effect.

3.1 Review of Cross-Entropy Loss

We first review the common operation in a classification problem. Given an input sample x with label y, a classification model produces $\mathrm {C}$ scores $\left\{ z_i \right\} _{i=1}^\mathrm {C}$. Then we canonically get the output probability distribution p by softmax function:

$$\begin{aligned} p_i=\frac{e^{z_i}}{\sum \limits _j e^{z_j}} \end{aligned}$$

(1)

The derivative of softmax is:

$$\begin{aligned} \frac{\partial p_i}{\partial z_j} = {\left\{ \begin{array}{ll} p_j \left( 1-p_j \right) , &{} i=j\\ -p_i p_j, &{} i\ne j \end{array}\right. } \end{aligned}$$

(2)

The cross-entropy (CE) loss and its derivative are:

$$\begin{aligned} L_{\mathrm {CE}}=-\log p_y \end{aligned}$$

(3)

$$\begin{aligned} \frac{\partial L_{\mathrm {CE}}}{\partial z_i} = {\left\{ \begin{array}{ll} p_i-1<0, &{} i=y\\ p_i>0, &{} i\ne y \end{array}\right. } \end{aligned}$$

(4)

Optimized by using gradient descent, the model prompts the probability of the y-th class to be higher and higher and that of other classes to be lower and lower, which leads to a approximation of one-hot vector in training dataset. Consequently we get confident output with low entropy that is often a symptom of overfitting [27]. In image classification, a confident model tends to focus on many regions, even including noisy background, to have sufficient clues for assertive prediction.

3.2 Maximum Entropy Regularization

What really matters is the discriminative region instead of noisy background in training set, which brings more uncertainty of depicting an image of a class. In other words, the prediction has a large entropy.

We would like to regularize the entropy of the output probability distribution $\left\{ p_i \right\} _{i=1}^\mathrm {C}$ to make model more general and alleviate the overfitting pain. The entropy is formulated as:

$$\begin{aligned} H\left( \mathbf{p} \right) =-\sum \limits _{i=1}^{\mathrm {C}}p_i \log p_i \end{aligned}$$

(5)

Mathematically, the entropy $H\left( \mathbf{p} \right) $ reaches the minimum when p is a one-hot vector, and maximum when p is the uniform distribution. The former is realized automatically by the vanilla cross-entropy loss, while the latter is promising to contribute to regularizing. Hence we take the negative entropy as Maximum Entropy Regularization (MER) term which is directly imposed on the common cross-entropy loss function:

$$\begin{aligned} L_{\mathrm {MER}}=-H\left( \mathbf{p} \right) \end{aligned}$$

(6)

$$\begin{aligned} L_{\mathrm {REG}}=L_{\mathrm {CE}}+\lambda L_{\mathrm {MER}} \end{aligned}$$

(7)

where $\lambda $ is the hyperparameter deciding the influence of MER. Intuitively, MER reduces the extreme confidence caused by cross-entropy loss. $L_{\mathrm {CE}}$ and $L_{\mathrm {MER}}$ perform like two kinds of forces that push the probability of positive class to opposite direction, as shown in Fig. 2. The subsequent part illustrates it from the perspective of gradient.

3.3 Derivative of Regularized Loss

We now consider the derivative of the regularized loss with respect to output scores $\left\{ z_i \right\} _{i=1}^\mathrm {C}$ which is directly related to the model, just like Eq.4.

The derivative with respect to the probability distribution is:

$$\begin{aligned} \frac{\partial L_{\mathrm {REG}}}{\partial p_i} = {\left\{ \begin{array}{ll} -\frac{1}{p_i}+\lambda \left( \log p_i +1\right) , &{} i=y\\ \lambda \left( \log p_i + 1\right) , &{} i\ne y \end{array}\right. } \end{aligned}$$

(8)

According to Eq. 8, Eq. 2 and the chain rule for derivation, the derivative with respect to the score is:

$$\begin{aligned} \frac{\partial L_{\mathrm {REG}}}{\partial z_i}= {\left\{ \begin{array}{ll} p_i \left( 1-\frac{1}{p_i}+\lambda \log p_i - \lambda \sum \limits _j p_j \log p_j\right) , &{} i=y\\ p_i \left( 1 + \lambda \log p_i - \lambda \sum \limits _j p_j \log p_j\right) , &{} i\ne y \end{array}\right. } \end{aligned}$$

(9)

By defining a cell function f:

$$\begin{aligned} f(q)=q\left( 1 + \lambda \log q - \lambda \sum \limits _j p_j \log p_j\right) \end{aligned}$$

(10)

we reformulate Eq. 9 as:

$$\begin{aligned} \frac{\partial L_{\mathrm {REG}}}{\partial z_i}= {\left\{ \begin{array}{ll} f(p_i)-1, &{} i=y\\ f(p_i), &{} i\ne y \end{array}\right. } \end{aligned}$$

(11)

Note that Eq. 11 has the similar elegant format with Eq. 4. Differently, with $p_i\in [0, 1]$, the gradient in Eq. 11 is not always positive or negative, so the probabilities are not decreasing to 0 or increasing to 1 under more distributed scores.

3.4 Convergence Probability Distribution

Here we give the theoretical convergence probability distribution with a little strong assumption. To simplify the problem, we now only consider the probabilities instead of scores, which means the softmax operation is ignored:

$$\begin{aligned} \text {min } L_{\mathrm {REG}}&=-\log p_y+\lambda \sum \limits _{i=1}^{\mathrm {C}}p_i \log p_i\\ \nonumber \text {s.t. } \sum \limits _{i=1}^{\mathrm {C}}p_i&=1 \end{aligned}$$

(12)

A natural idea to solve it is to convert it to an unconstrained optimization problem by lagrange multiplier:

$$\begin{aligned} \text {min } L_{\mathrm {lag}}=-\log p_y+\lambda \sum \limits _{i=1}^{\mathrm {C}}p_i \log p_i + \alpha \left( \sum \limits _{i=1}^{\mathrm {C}}p_i-1\right) \end{aligned}$$

(13)

We assume that every negative class has the same position. Setting the derivatives with respect to $p_i$ and $\alpha $ to 0, we get the convergence probability relation between positive class and negative class:

$$\begin{aligned} p_i=p_y e^{-\frac{1}{\lambda p_y}}, i\ne y \end{aligned}$$

(14)

Furthermore, the relationship between the convergence probability and the coefficience $\lambda $ can be formulated by Eq. 14 and the constrain in Eq. 12 as:

$$\begin{aligned} \lambda = \frac{m}{\log \frac{\mathrm {C}-1}{m-1}} \end{aligned}$$

(15)

where $m=\frac{1}{p_y}>1$. Equations 14 and 15 together indicate the ideal output probability distribution when model converges for a given $\lambda $. To be more intuitive, we plot the $p_y-\lambda $ curve in Fig. 3. The convergence probability of positive class decreases monotonically with the increase of $\lambda $. When $\lambda \rightarrow +\infty $, $p_y \rightarrow \frac{1}{\mathrm {C}}$, which is exactly a uniform distribution.

Under the assumption, MER has no difference from label smoothing [27]. They both leverage a uniform distribution to regularize, which is unrealistic for all classes. We can bridge them through the $p_y-\lambda $ curve. Also we get a deeper understanding of $\lambda $ and have some guidance on how to choose a proper $\lambda $.

In fact, MER is stronger than label smoothing since it has more potential beyond the naive assumption. Former [2, 19] and present experiments are all illustrating this.

4 Experiments

We make several experiments on both Chinese text recognition and fine-grained image classification using PyTorch [18] to prove the power of MER. Besides, we verify the $p_y-\lambda $ curve, compare MER with label smoothing, and investigate the effectiveness when model is trained with label corruption [3].

4.1 Chinese Text Recognition

We conduct Chinese character recognition and Chinese text line recognition respectively.

Datasets. CTW is a very large dataset of Chinese text in street view images [32]. The dataset contains 32,285 images with 1,018,402 Chinese characters from 3850 unique ones. The annotation is only in character level. The texts appear as various styles, including planar text, raised text, text under poor illumination, distant text, partially occluded text, etc.

ReCTS consists of 25,000 scene text images captured from signboards [6]. All text lines and characters are labeled. It has 440,027 Chinese characters from 4,435 unique ones and 108,924 Chinese text lines from 4,135 unique characters.

Implementation Details. For Chinese character recognition, we take it as a classification problem and use SE-ResNet50 [9] as backbone. Images are resized to the same size $32\times 32$. The training batch size is 128. In CTW, specifically, we use both training and validation set to train with 3665 characters instead of only 1000 common characters [33]. For Chinese text line recognition, we use attention-based Encoder-Decoder as the same in ASTER [24] but without STN framework. Images are resized to $32\times 128$. Batch size is 64.

Data augmentation is used, including changing angles in range [−10°, 10°], performing perspective transformation and changing the brightness, contrast and saturation randomly. We first train a base model from scratch. Then we finetune the pretrained model with/without MER using a same training strategy to keep fair. Stochastic gradient decent (SGD) with Momentum is used for optimization and the learning rate hits a decay (from 1e−2 to 1e−5, decay rate is 0.1) if the training loss stops falling for a while. We set the weight decay as 1e−4 and momentum as 0.9. All the models are trained in one NVIDIA 1080Ti graphics card with 11 GB memory. It takes less than 12 h to reach convergence in character recognition, and about 2 days to reach convergence in text line recognition.

Table 1. Accuracy of Chinese text recognition

Full size table

Results. Training using MER improves accuracy without any additional overhead. A model thus can be strengthened easily. As shown in Table 1, model trained with MER outperforms the one without MER on both character recognition and text line recognition. MER can even make a model outperforms another deeper one, like SE-ResNet50 and ResNet152 in Table 1(a).

To be more intuitive, we visualize the region response of character images in CTW by summing over the intermediate feature map channels and exerting min-max normalization. As shown in the top two rows of Fig. 4, the model trained without MER usually has distributed response and is prone to focus on noisy regions. By using MER, the model concentrates mainly on the text body, thus is more robust to noisy background.

4.2 Fine-Grained Image Classification

Text recognition can be regarded as a kind of fine-grained image classification since many characters have subtle inter-class but big intra-class difference. To be more general, we also verify the effectiveness of MER on the classical fine-grained image dataset: CUB-200-2011.

CUB-200-2011 contains 11,788 images of 200 types of birds, 5,994 for training and 5,794 for testing [28]. It is a typical and popular dataset for fine-grained image classification.

Implementation Details. When preprocessing the training data, we adopt random crop and random horizontal flip to augment data. Then the images are resized to $448\times 448$. ResNet50 [7] is the backbone network whose parameters are initialized from pretrained model on ImageNet. We train the model for 80 epochs with batch size set as 8 using the Momentum SGD optimizer. Learning rate starts from 1e-3 and decays by 0.3 when the current epoch is in [40, 60, 70]. When we use MER, $\lambda $ is set to [1.0, 0.5, 0.2, 0.1] empirically when the current epoch is [30, 50, 70] respectively.

Results. Our simple ResNet50 trained with MER also gains a lot, even outperforms the former complicated models, as shown in Table 2. Hence our model is both accurate and fast.

We visualize the activation map using the last convolutional feature map and the last linear layer weights by CAM [40]. As shown in the bottom two rows of Fig. 4, MER makes a model focus on more compact and discriminative region, and ignore the noisy background which is harmful to model generalization. In some circumstances, appearance in background or common body region (not discriminative) can make a model more confident on training set, but in this way the really discriminative region are not attended enough.

Table 2. Accuracy on CUB-200-2011

Full size table

Table 3. Theoretical and experimental convergence probability of positive class (CPP) with different $\lambda $ on CTW character recognition.

Full size table

4.3 Verification of $p_y-\lambda $ Curve

We experimentally verify the theoretical convergence probability distribution described as Eq. 15 and Fig. 3.

Training with MER on CTW character recognition, we set a fixed $\lambda $ for every experiment. When the model converges, the value of cross-entropy loss in training set is used to calculate the expected experimental convergence probability of positive class (CPP):

$$\begin{aligned} p_y=e^{-L_{\mathrm {CE}}} \end{aligned}$$

(16)

The theoretical convergence probability can be got directly from the curve in Fig. 3. As shown in Table 3, the experimental value is always slightly lower than the theoretical value. This is normal since there are no perfect models that can fit a complicated distribution completely. What is more, the curve is also only a proximation under an assumption that very negative class has the same position. So they are actually in accordance. As a result, the theoretical curve can be used to estimate the convergence probability distribution in training set roughly, which gives us a practical meaning of $\lambda $ and may guide us to choose a proper $\lambda $.

4.4 Comparison with Label Smoothing

Since MER is very similar to label smoothing (LS), we compare their results on CUB-200-2011. LS also has a coefficience $\lambda $ which means the final convergence probability of positive class (CPP) is $\left( 1-\lambda +\frac{\lambda }{\mathrm {C}}\right) $. Hence both the two methods have the attribute that CPP decreases as $\lambda $ increases.

Inspired by the theoretical relationship between CPP ($p_y$) and $\lambda $, we choose 4 theoretical CPPs to have 4 pairs of experiments. For each CPP, $\lambda $ of LS is $1-p_y$, and $\lambda $ of MER is chosen from the curve in Fig. 3 as the previous part.

Table 4. Comparison of MER and label smoothing on CUB-200-2011

Full size table

Table 5. Results of model trained with label corruption. Note that $\lambda = 0.0$ means the model is trained without MER.

Full size table

As shown in Table 4, MER always gets improvement more or less, while LS is more sensitive to $\lambda $ and can even be harmful (see CPP=24). MER can achieve better accuracy than LS with their own $\lambda $s. Besides, LS only works well when $\lambda $ is small. We argue that the negative influence of the dependence on uniform distribution can be zoomed and nonnegligible with a large $\lambda $ (a small CPP), which limits the potential ability of LS. By contrast, MER is more flexible to regularize the expected probability distribution. With a same theoretical CPP, training entropy of LS is higher than MER, which also reflects the influence of uniform distribution.

4.5 Train with Label Corruption

To explore the power of MER on noisy dataset, we randomly corrupt labels of a certain proportion of training images. Label corruption is usually more harmful than feature corruption [3]. The false labels can mislead the learning process.

We find that the model trained with MER is more robust on both Chinese character recognition and fine-grained classification. In Table 5, corruption rate is the proportion of training images whose labels are randomly corrupted. The model trained without MER ($\lambda =0.0$) suffers a lot because it is always very confident and reach to blind devotion to the given labels, even though some labels are wrong. MER regularizes a model to be less confident of labels in training set, so the learning process is less disturbed.

5 Conclusion

In this paper, we make a deep analysis on Maximum Entropy Regularization, including how MER term influence the convergence probability distribution and the learning process. MER improves generalization and robustness of a model without any additional parameters. We employ MER on both Chinese text recognition and common fine-grained classification to alleviate overfitting, and gain consistent improvement. We hope that our theoretical analysis can be useful for the further study on the regularization.

References

Cheng, C., Huang, Q., Bai, X., Feng, B., Liu, W.: Patch aggregator for scene text script identification. In: 2019 15th International Conference on Document Analysis and Recognition (ICDAR), pp. 1077–1083. IEEE (2019)
Google Scholar
Dubey, A., Gupta, O., Raskar, R., Naik, N.: Maximum-entropy fine grained classification. In: Advances in Neural Information Processing Systems, pp. 637–647 (2018)
Google Scholar
Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Networks Learn. Syst. 25(5), 845–869 (2013)
Article Google Scholar
Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4438–4446 (2017)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)
Google Scholar
Group, M.D.: ICDAR 2019 robust reading challenge on reading Chinese text on signboard (2019). https://rrc.cvc.uab.es/?ch=12
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)
Google Scholar
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620 (1957)
Article MathSciNet Google Scholar
Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., Yao, C., Bai, X.: Scene text recognition from two-dimensional perspective. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8714–8721 (2019)
Google Scholar
Liu, H., Jin, S., Zhang, C.: Connectionist temporal classification with maximum entropy regularization. In: Advances in Neural Information Processing Systems, pp. 831–841 (2018)
Google Scholar
Lyu, P., Yang, Z., Leng, X., Wu, X., Li, R., Shen, X.: 2d attentional irregular scene text recognizer. arXiv preprint arXiv:1906.05708 (2019)
Ma, L.L., Liu, C.L.: A new radical-based approach to online handwritten Chinese character recognition. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4. IEEE (2008)
Google Scholar
Müller, R., Kornblith, S., Hinton, G.: When does label smoothing help? arXiv preprint arXiv:1906.02629 (2019)
Paszke, A., et al.: Automatic differentiation in PyTorch (2017)
Google Scholar
Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., Hinton, G.: Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548 (2017)
Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 67–72. IEEE (2017)
Google Scholar
Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596 (2014)
Reeve Ingle, R., Fujii, Y., Deselaers, T., Baccash, J., Popat, A.C.: A scalable handwritten text recognition system. arXiv preprint arXiv:1904.09150 (2019)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Article Google Scholar
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Sun, M., Yuan, Y., Zhou, F., Ding, E.: Multi-attention multi-class constraint for fine-grained image recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 834–850. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_49
Chapter Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD birds-200-2011 dataset (2011)
Google Scholar
Wang, T.Q., Yin, F., Liu, C.L.: Radical-based Chinese character recognition via multi-labeled learning of deep residual networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 579–584. IEEE (2017)
Google Scholar
Wu, C., Wang, Z.R., Du, J., Zhang, J., Wang, J.: Joint spatial and radical analysis network for distorted Chinese character recognition. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 5, pp. 122–127. IEEE (2019)
Google Scholar
Yepes, A.J.: Confidence penalty, annealing Gaussian noise and zoneout for biLSTM-CRF networks for named entity recognition. arXiv preprint arXiv:1808.04029 (2018)
Yuan, T.L., Zhu, Z., Xu, K., Li, C.J., Hu, S.M.: Chinese text in the wild. arXiv preprint arXiv:1803.00085 (2018)
Yuan, T.L., Zhu, Z., Xu, K., Li, C.J., Mu, T.J., Hu, S.M.: A large Chinese text dataset in the wild. J. Comput. Sci. Technol. 34(3), 509–521 (2019)
Article Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zhang, J., Zhu, Y., Du, J., Dai, L.: Radical analysis network for zero-shot learning in printed Chinese character recognition. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2018)
Google Scholar
Zhang, X.Y., Bengio, Y., Liu, C.L.: Online and offline handwritten Chinese character recognition: a comprehensive study and new benchmark. Pattern Recogn. 61, 348–360 (2017)
Article Google Scholar
Zhang, X.Y., Wu, Y.C., Yin, F., Liu, C.L.: Deep learning based handwritten Chinese character and text recognition. In: Huang, K., Hussain, A., Wang, Q.F., Zhang, R. (eds.) Deep Learning: Fundamentals, Theory and Applications. Cognitive Computation Trends, vol. 2, pp. 57–88. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-06073-2_3
Chapter Google Scholar
Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5209–5217 (2017)
Google Scholar
Zhong, Z., Jin, L., Feng, Z.: Multi-font printed Chinese character recognition using multi-pooling convolutional neural network. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 96–100. IEEE (2015)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC, grant No. 61733007).

Author information

Authors and Affiliations

Huazhong University of Science and Technology, Wuhan, China
Changxu Cheng, Wuheng Xu, Xiang Bai, Bin Feng & Wenyu Liu

Authors

Changxu Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Wuheng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Bai
View author publications
You can also search for this author in PubMed Google Scholar
Bin Feng
View author publications
You can also search for this author in PubMed Google Scholar
Wenyu Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Changxu Cheng .

Editor information

Editors and Affiliations

Huazhong University of Science and Technology, Wuhan, China
Xiang Bai
Autonomous University of Barcelona, Barcelona, Spain
Dimosthenis Karatzas
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, C., Xu, W., Bai, X., Feng, B., Liu, W. (2020). Maximum Entropy Regularization and Chinese Text Recognition. In: Bai, X., Karatzas, D., Lopresti, D. (eds) Document Analysis Systems. DAS 2020. Lecture Notes in Computer Science(), vol 12116. Springer, Cham. https://doi.org/10.1007/978-3-030-57058-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-57058-3_1
Published: 14 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57057-6
Online ISBN: 978-3-030-57058-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Maximum Entropy Regularization and Chinese Text Recognition

Abstract

Similar content being viewed by others

Discrete representation learning for handwritten text recognition

TextAdaIN: Paying Attention to Shortcut Learning in Text Recognizers

A comprehensive study of hybrid neural network hidden Markov model for offline handwritten Chinese text recognition

Keywords

1 Introduction