Keywords

1 Introduction

Text recognition is a popular topic in the deep learning community. Most of the existing deep learning-based works  [13, 15, 20, 22,23,24] pay attention to Latin script and achieve good performance.

However, Chinese text differs much from Latin text. There are thousands of common Chinese characters appearing as various styles. Chinese text recognition can be regarded as a kind of fine-grained image classification due to the high inter-class similarity and the large intra-class variance, as shown in Fig. 1(a). Besides, there is usually a great data imbalance over character classes [32, 33]. These features cause a large demand for training data. Thus character-based recognition model is prone to overfit. Radical-based methods [16, 29, 30, 35] convert Chinese characters to radicals as the basic class to simplify the character structure and decrease the number of classes, thus reducing the demand for training data. They are mentioned here to prove the large demand for data if we use character-based recognition. Nevertheless, they are not flexible enough for various circumstances, e.g., handwritten text [37]. And current methods in that way cost much time in practice due to the use of RNN-based decoder.

Fig. 1.
figure 1

Example of Chinese characters and 2 kinds of model prediction. (a) The 2 images in each column are a same character class and the 3 in each row are different. It shows the high inter-class similarity and the large intra-class variance which is a fine-grained attribute. (b) Although the 2 different probability distributions have a same prediction in training set, the left one holds a higher entropy that can describe the learned feature better. Obviously, is far more similar to than , so the confidences on them is supposed to have a large distinction.

A common practice to deal with the overfitting problem is to regularize the model during training. There are several techniques aiming at this, including dropout [25], L2 regularization, batch normalization  [10]. They act on model parameters or hidden units like a blackbox. We consider regularization from the perspective of entropy. The predicted probability distribution are an indication of how the network generalizes [19]. In Chinese character recognition, we hope that a similar negative class is assigned a larger probability than the dissimilar one given an input image. It requires the probability of positive class to be not that large to leave probability space for other negative classes, which causes a big entropy. This is okay because we can recognize correctly as long as the probability of the positive class is the largest. The maximum entropy principle  [12] also points out that the model with the largest entropy of output distribution can represent features best. Figure 1(b) illustrates the idea by comparing the 2 probability distributions with high and low entropy respectively. Hence we adopt Maximum Entropy Regularization [19] to regularize the training process.

In this paper, we perform an in-depth analysis of Maximum Entropy Regularization theoretically. The cross-entropy loss and the negative entropy term behave like two 1-dimensional forces which function on the output probability distribution. The elegant gradient function illustrates the regularized behaviour from the perspective of backward propagation. Under an assumption similar to label smoothing, we formulate the convergence probability distribution in training set, which is exactly a relationship between the convergence probability of positive class and the coefficience.

We conduct experiments on Chinese character recognition, Chinese text line recognition and fine-grained image classification and gain consistent improvement. In addition, we find that model trained with MER can attend on more compact and discriminative regions and filter much noisy area. MER also makes model more robust when label corruption is exerted to our training data.

2 Related Works

Our work focuses on model regularization and its application to Chinese text recognition. Here we briefly review some recent works about these two aspects.

2.1 Model Regularization

Large deep neural network is prone to overfit in many cases. To relieve the problem, model regularization is commonly used during training. Dropout  [25] is to randomly drop some neurons or connections with a certain probability in layers. L2 regularization is also called weight decay, which restricts the magnitude of model weights. Batch normalization  [10] normalizes hidden units in a training batch to reduce internal covariate shift. These methods act on model parameters or hidden layers, which is hard to control and not intuitive. Mixup  [34] simply uses linear operation on both input images and their labels with an assumption that linear interpolations of features should lead to linear interpolations of the associated targets.

Recently, output distribution of neural network has earned much attention. Knowledge distillation  [8] is proposed to train a small-size model to have a similar output distribution to a large model since the “soft targets” can transfer the generalization ability, which indicates the effect of output distribution. Label smoothing  [27] is proposed to encourage the prediction to be less confident by disturbing the one-hot ground truth label with a uniform distribution, which actually adds a KL-divergence (between the uniform and the output distribution) term to the cross-entropy loss in the view of loss function. Label smoothing is able to improve generalization and model calibration, and benefit distillation  [17]. Softermax loss  [1] prompts the summation of top-k output probabilities to be as great as possible to alleviate the extreme confidence caused by cross-entropy loss. But the hyperparameter k is hard to choose. Bootstrapping  [21] aims at leveraging ground truth distribution and output distribution as expected distribution, thus the model can train well on dataset with noisy label  [3]. But the output entropy is still as low as that with cross-entropy loss. According to maximum entropy principle  [12], the model whose probability distribution represents the current state of knowledge best is the one with the largest entropy. Correspondingly, a maximum-entropy based method, called confidence penalty  [19], includes a negative entropy term in loss function, which acts similarly to label smoothing but performs better. As a result, it has attracted researcher’s interest in applying it to multiple tasks, e.g., sequence modeling  [14], named entity recognition  [31] and fine-grained image classification [2]. CTC  [5]-based sequence model trys to penalizes peaky distributions by using maximum-entropy based regularization  [14]. However, none of these works make a deep analysis on the regularization term or the whole loss function theoretically or experimentally.

2.2 Chinese Text Recognition

Chinese text recognition is more challenging than Latin script due to the more character categories and the more complicate layout. There are usually two tasks derived from Chinese text recognition: Chinese character recognition (CCR) and Chinese text line recognition (CLR).

As for Chinese character, recent methods can be divided into two streams: character-based CCR (CCCR) [39] and radical-based CCR (RCCR) [35]. Taken as a single class, every Chinese character is well classified with deep learning [36]. However, CCCR has no capability to handle unseen characters. By considering the structure of Chinese character, RCCR methods exploit radicals to represent a character [16, 29, 30, 35]. Multi-label learning has been used to detect radicals [29]. Radical analysis network (RAN) [35] takes the spatial structure of a single Chinese character as a radical sequence and decodes with an attention-based RNN. JSRAN [30] improves RAN by jointly using STN [11] and RAN. However, RAN is very time-consuming during inference, and RCCR is hard to tackle some nonstandard handwritten text. In this paper, we choose CCCR and use SE-ResNet-50 [9] as backbone for CCR.

As for Chinese text line, currently there are also two streams: one is Convolutional Recurrent Neural Network (CRNN) [23] with CTC [5], the other is attention-based Encoder-Decoder [24]. The vanilla version of them can only process image one-dimensionally. 2D-attention [13, 15] decodes an encoded text image from two-dimensional perspective, which is an extension of the latter. In our experiments, we simply use 1-d attention-based Encoder-Decoder for CLR.

3 Analysis on Maximum Entropy Regularization

Unlike the previous works which only analyze the entropy term, we study both the term and the complete loss function to discover the joint effect.

3.1 Review of Cross-Entropy Loss

We first review the common operation in a classification problem. Given an input sample x with label y, a classification model produces \(\mathrm {C}\) scores \(\left\{ z_i \right\} _{i=1}^\mathrm {C}\). Then we canonically get the output probability distribution p by softmax function:

$$\begin{aligned} p_i=\frac{e^{z_i}}{\sum \limits _j e^{z_j}} \end{aligned}$$
(1)

The derivative of softmax is:

$$\begin{aligned} \frac{\partial p_i}{\partial z_j} = {\left\{ \begin{array}{ll} p_j \left( 1-p_j \right) , &{} i=j\\ -p_i p_j, &{} i\ne j \end{array}\right. } \end{aligned}$$
(2)

The cross-entropy (CE) loss and its derivative are:

$$\begin{aligned} L_{\mathrm {CE}}=-\log p_y \end{aligned}$$
(3)
$$\begin{aligned} \frac{\partial L_{\mathrm {CE}}}{\partial z_i} = {\left\{ \begin{array}{ll} p_i-1<0, &{} i=y\\ p_i>0, &{} i\ne y \end{array}\right. } \end{aligned}$$
(4)

Optimized by using gradient descent, the model prompts the probability of the y-th class to be higher and higher and that of other classes to be lower and lower, which leads to a approximation of one-hot vector in training dataset. Consequently we get confident output with low entropy that is often a symptom of overfitting  [27]. In image classification, a confident model tends to focus on many regions, even including noisy background, to have sufficient clues for assertive prediction.

3.2 Maximum Entropy Regularization

What really matters is the discriminative region instead of noisy background in training set, which brings more uncertainty of depicting an image of a class. In other words, the prediction has a large entropy.

We would like to regularize the entropy of the output probability distribution \(\left\{ p_i \right\} _{i=1}^\mathrm {C}\) to make model more general and alleviate the overfitting pain. The entropy is formulated as:

$$\begin{aligned} H\left( \mathbf{p} \right) =-\sum \limits _{i=1}^{\mathrm {C}}p_i \log p_i \end{aligned}$$
(5)

Mathematically, the entropy \(H\left( \mathbf{p} \right) \) reaches the minimum when p is a one-hot vector, and maximum when p is the uniform distribution. The former is realized automatically by the vanilla cross-entropy loss, while the latter is promising to contribute to regularizing. Hence we take the negative entropy as Maximum Entropy Regularization (MER) term which is directly imposed on the common cross-entropy loss function:

$$\begin{aligned} L_{\mathrm {MER}}=-H\left( \mathbf{p} \right) \end{aligned}$$
(6)
$$\begin{aligned} L_{\mathrm {REG}}=L_{\mathrm {CE}}+\lambda L_{\mathrm {MER}} \end{aligned}$$
(7)

where \(\lambda \) is the hyperparameter deciding the influence of MER. Intuitively, MER reduces the extreme confidence caused by cross-entropy loss. \(L_{\mathrm {CE}}\) and \(L_{\mathrm {MER}}\) perform like two kinds of forces that push the probability of positive class to opposite direction, as shown in Fig. 2. The subsequent part illustrates it from the perspective of gradient.

Fig. 2.
figure 2

Illustration of how MER term influences the convergence probability of positive class. Cross-entropy loss without MER always pushes the convergence probability to 1.0, whereas MER term pushes the probabilities to uniform distribution. They behave like two kinds of force, and the probability will finally reach to a point where the 2 forces get balanced.

3.3 Derivative of Regularized Loss

We now consider the derivative of the regularized loss with respect to output scores \(\left\{ z_i \right\} _{i=1}^\mathrm {C}\) which is directly related to the model, just like Eq.4.

The derivative with respect to the probability distribution is:

$$\begin{aligned} \frac{\partial L_{\mathrm {REG}}}{\partial p_i} = {\left\{ \begin{array}{ll} -\frac{1}{p_i}+\lambda \left( \log p_i +1\right) , &{} i=y\\ \lambda \left( \log p_i + 1\right) , &{} i\ne y \end{array}\right. } \end{aligned}$$
(8)

According to Eq. 8, Eq. 2 and the chain rule for derivation, the derivative with respect to the score is:

$$\begin{aligned} \frac{\partial L_{\mathrm {REG}}}{\partial z_i}= {\left\{ \begin{array}{ll} p_i \left( 1-\frac{1}{p_i}+\lambda \log p_i - \lambda \sum \limits _j p_j \log p_j\right) , &{} i=y\\ p_i \left( 1 + \lambda \log p_i - \lambda \sum \limits _j p_j \log p_j\right) , &{} i\ne y \end{array}\right. } \end{aligned}$$
(9)

By defining a cell function f:

$$\begin{aligned} f(q)=q\left( 1 + \lambda \log q - \lambda \sum \limits _j p_j \log p_j\right) \end{aligned}$$
(10)

we reformulate Eq. 9 as:

$$\begin{aligned} \frac{\partial L_{\mathrm {REG}}}{\partial z_i}= {\left\{ \begin{array}{ll} f(p_i)-1, &{} i=y\\ f(p_i), &{} i\ne y \end{array}\right. } \end{aligned}$$
(11)

Note that Eq. 11 has the similar elegant format with Eq. 4. Differently, with \(p_i\in [0, 1]\), the gradient in Eq. 11 is not always positive or negative, so the probabilities are not decreasing to 0 or increasing to 1 under more distributed scores.

3.4 Convergence Probability Distribution

Here we give the theoretical convergence probability distribution with a little strong assumption. To simplify the problem, we now only consider the probabilities instead of scores, which means the softmax operation is ignored:

$$\begin{aligned} \text {min } L_{\mathrm {REG}}&=-\log p_y+\lambda \sum \limits _{i=1}^{\mathrm {C}}p_i \log p_i\\ \nonumber \text {s.t. } \sum \limits _{i=1}^{\mathrm {C}}p_i&=1 \end{aligned}$$
(12)
Fig. 3.
figure 3

The \(p_y-\lambda \) curve. The convergence probability of positive class decreases as \(\lambda \) increases. The more classes we have, the lower probability the model converges to.

A natural idea to solve it is to convert it to an unconstrained optimization problem by lagrange multiplier:

$$\begin{aligned} \text {min } L_{\mathrm {lag}}=-\log p_y+\lambda \sum \limits _{i=1}^{\mathrm {C}}p_i \log p_i + \alpha \left( \sum \limits _{i=1}^{\mathrm {C}}p_i-1\right) \end{aligned}$$
(13)

We assume that every negative class has the same position. Setting the derivatives with respect to \(p_i\) and \(\alpha \) to 0, we get the convergence probability relation between positive class and negative class:

$$\begin{aligned} p_i=p_y e^{-\frac{1}{\lambda p_y}}, i\ne y \end{aligned}$$
(14)

Furthermore, the relationship between the convergence probability and the coefficience \(\lambda \) can be formulated by Eq. 14 and the constrain in Eq. 12 as:

$$\begin{aligned} \lambda = \frac{m}{\log \frac{\mathrm {C}-1}{m-1}} \end{aligned}$$
(15)

where \(m=\frac{1}{p_y}>1\). Equations 14 and 15 together indicate the ideal output probability distribution when model converges for a given \(\lambda \). To be more intuitive, we plot the \(p_y-\lambda \) curve in Fig. 3. The convergence probability of positive class decreases monotonically with the increase of \(\lambda \). When \(\lambda \rightarrow +\infty \), \(p_y \rightarrow \frac{1}{\mathrm {C}}\), which is exactly a uniform distribution.

Under the assumption, MER has no difference from label smoothing  [27]. They both leverage a uniform distribution to regularize, which is unrealistic for all classes. We can bridge them through the \(p_y-\lambda \) curve. Also we get a deeper understanding of \(\lambda \) and have some guidance on how to choose a proper \(\lambda \).

In fact, MER is stronger than label smoothing since it has more potential beyond the naive assumption. Former  [2, 19] and present experiments are all illustrating this.

4 Experiments

We make several experiments on both Chinese text recognition and fine-grained image classification using PyTorch [18] to prove the power of MER. Besides, we verify the \(p_y-\lambda \) curve, compare MER with label smoothing, and investigate the effectiveness when model is trained with label corruption  [3].

4.1 Chinese Text Recognition

We conduct Chinese character recognition and Chinese text line recognition respectively.

Datasets. CTW is a very large dataset of Chinese text in street view images [32]. The dataset contains 32,285 images with 1,018,402 Chinese characters from 3850 unique ones. The annotation is only in character level. The texts appear as various styles, including planar text, raised text, text under poor illumination, distant text, partially occluded text, etc.

ReCTS consists of 25,000 scene text images captured from signboards [6]. All text lines and characters are labeled. It has 440,027 Chinese characters from 4,435 unique ones and 108,924 Chinese text lines from 4,135 unique characters.

Implementation Details. For Chinese character recognition, we take it as a classification problem and use SE-ResNet50 [9] as backbone. Images are resized to the same size \(32\times 32\). The training batch size is 128. In CTW, specifically, we use both training and validation set to train with 3665 characters instead of only 1000 common characters  [33]. For Chinese text line recognition, we use attention-based Encoder-Decoder as the same in ASTER [24] but without STN framework. Images are resized to \(32\times 128\). Batch size is 64.

Data augmentation is used, including changing angles in range [−10°, 10°], performing perspective transformation and changing the brightness, contrast and saturation randomly. We first train a base model from scratch. Then we finetune the pretrained model with/without MER using a same training strategy to keep fair. Stochastic gradient decent (SGD) with Momentum is used for optimization and the learning rate hits a decay (from 1e−2 to 1e−5, decay rate is 0.1) if the training loss stops falling for a while. We set the weight decay as 1e−4 and momentum as 0.9. All the models are trained in one NVIDIA 1080Ti graphics card with 11 GB memory. It takes less than 12 h to reach convergence in character recognition, and about 2 days to reach convergence in text line recognition.

Table 1. Accuracy of Chinese text recognition

Results. Training using MER improves accuracy without any additional overhead. A model thus can be strengthened easily. As shown in Table 1, model trained with MER outperforms the one without MER on both character recognition and text line recognition. MER can even make a model outperforms another deeper one, like SE-ResNet50 and ResNet152 in Table 1(a).

To be more intuitive, we visualize the region response of character images in CTW by summing over the intermediate feature map channels and exerting min-max normalization. As shown in the top two rows of Fig. 4, the model trained without MER usually has distributed response and is prone to focus on noisy regions. By using MER, the model concentrates mainly on the text body, thus is more robust to noisy background.

4.2 Fine-Grained Image Classification

Text recognition can be regarded as a kind of fine-grained image classification since many characters have subtle inter-class but big intra-class difference. To be more general, we also verify the effectiveness of MER on the classical fine-grained image dataset: CUB-200-2011.

CUB-200-2011 contains 11,788 images of 200 types of birds, 5,994 for training and 5,794 for testing [28]. It is a typical and popular dataset for fine-grained image classification.

Implementation Details. When preprocessing the training data, we adopt random crop and random horizontal flip to augment data. Then the images are resized to \(448\times 448\). ResNet50 [7] is the backbone network whose parameters are initialized from pretrained model on ImageNet. We train the model for 80 epochs with batch size set as 8 using the Momentum SGD optimizer. Learning rate starts from 1e-3 and decays by 0.3 when the current epoch is in [40, 60, 70]. When we use MER, \(\lambda \) is set to [1.0, 0.5, 0.2, 0.1] empirically when the current epoch is [30, 50, 70] respectively.

Fig. 4.
figure 4

Visualization of activation maps. Row 1–2 presents Chinese characters, and Row 3–4 presents fine-grained birds. Every triplet contains input image and attention map from model trained with/without MER. Training without MER is prone to make the model focus on broader region, including much background noise. MER regularizes the model to focus on more compact and discriminative region. Note that Row 1–2 is visualized by summing over intermediate feature map channels, whereas Row 3–4 is the class activation map.

Results. Our simple ResNet50 trained with MER also gains a lot, even outperforms the former complicated models, as shown in Table 2. Hence our model is both accurate and fast.

We visualize the activation map using the last convolutional feature map and the last linear layer weights by CAM [40]. As shown in the bottom two rows of Fig. 4, MER makes a model focus on more compact and discriminative region, and ignore the noisy background which is harmful to model generalization. In some circumstances, appearance in background or common body region (not discriminative) can make a model more confident on training set, but in this way the really discriminative region are not attended enough.

Table 2. Accuracy on CUB-200-2011
Table 3. Theoretical and experimental convergence probability of positive class (CPP) with different \(\lambda \) on CTW character recognition.

4.3 Verification of \(p_y-\lambda \) Curve

We experimentally verify the theoretical convergence probability distribution described as Eq. 15 and Fig. 3.

Training with MER on CTW character recognition, we set a fixed \(\lambda \) for every experiment. When the model converges, the value of cross-entropy loss in training set is used to calculate the expected experimental convergence probability of positive class (CPP):

$$\begin{aligned} p_y=e^{-L_{\mathrm {CE}}} \end{aligned}$$
(16)

The theoretical convergence probability can be got directly from the curve in Fig. 3. As shown in Table 3, the experimental value is always slightly lower than the theoretical value. This is normal since there are no perfect models that can fit a complicated distribution completely. What is more, the curve is also only a proximation under an assumption that very negative class has the same position. So they are actually in accordance. As a result, the theoretical curve can be used to estimate the convergence probability distribution in training set roughly, which gives us a practical meaning of \(\lambda \) and may guide us to choose a proper \(\lambda \).

4.4 Comparison with Label Smoothing

Since MER is very similar to label smoothing (LS), we compare their results on CUB-200-2011. LS also has a coefficience \(\lambda \) which means the final convergence probability of positive class (CPP) is \(\left( 1-\lambda +\frac{\lambda }{\mathrm {C}}\right) \). Hence both the two methods have the attribute that CPP decreases as \(\lambda \) increases.

Inspired by the theoretical relationship between CPP (\(p_y\)) and \(\lambda \), we choose 4 theoretical CPPs to have 4 pairs of experiments. For each CPP, \(\lambda \) of LS is \(1-p_y\), and \(\lambda \) of MER is chosen from the curve in Fig. 3 as the previous part.

Table 4. Comparison of MER and label smoothing on CUB-200-2011
Table 5. Results of model trained with label corruption. Note that \(\lambda = 0.0\) means the model is trained without MER.

As shown in Table 4, MER always gets improvement more or less, while LS is more sensitive to \(\lambda \) and can even be harmful (see CPP=24). MER can achieve better accuracy than LS with their own \(\lambda \)s. Besides, LS only works well when \(\lambda \) is small. We argue that the negative influence of the dependence on uniform distribution can be zoomed and nonnegligible with a large \(\lambda \) (a small CPP), which limits the potential ability of LS. By contrast, MER is more flexible to regularize the expected probability distribution. With a same theoretical CPP, training entropy of LS is higher than MER, which also reflects the influence of uniform distribution.

4.5 Train with Label Corruption

To explore the power of MER on noisy dataset, we randomly corrupt labels of a certain proportion of training images. Label corruption is usually more harmful than feature corruption [3]. The false labels can mislead the learning process.

We find that the model trained with MER is more robust on both Chinese character recognition and fine-grained classification. In Table 5, corruption rate is the proportion of training images whose labels are randomly corrupted. The model trained without MER (\(\lambda =0.0\)) suffers a lot because it is always very confident and reach to blind devotion to the given labels, even though some labels are wrong. MER regularizes a model to be less confident of labels in training set, so the learning process is less disturbed.

5 Conclusion

In this paper, we make a deep analysis on Maximum Entropy Regularization, including how MER term influence the convergence probability distribution and the learning process. MER improves generalization and robustness of a model without any additional parameters. We employ MER on both Chinese text recognition and common fine-grained classification to alleviate overfitting, and gain consistent improvement. We hope that our theoretical analysis can be useful for the further study on the regularization.