Keywords

1 Introduction

The image-to-markup problem has attracted interest of researchers from 1960s [2]. The main target of the research is recognition for the printed/handwritten mathematical expressions (MEs). Different from typical sequence-to-sequence tasks such as machine translation [4] and speech recognition [8], image-to-markup generation converts the two-dimensional (2D) images into sequences of structural presentational languages. More specifically, it has to scan the two dimensional grids to generate markup of the symbols and the implicit spatial operators, such as subscript and fractions. Image-to-markup generation is also different from other image-to-sequence tasks such as image captioning [7, 24] and text string recognition in optical character recognition [30], in that input images in image-to-markup problem contain much more structural information and spatial relations than general images in computer vision.

Traditional approaches for the image-to-markup problem use handcrafted grammars to handle symbol segmentation, symbol recognition and structural analysis. Although grammar-driven approaches [1, 3, 23, 26] can achieve high performance in practice, they require a large amount of manual work to develop grammatical rules. Furthermore, grammar-driven structural analysis is also highly computationally demanding.

Fig. 1.
figure 1

MEs written by different people (top) and their standard printed template (center). These images have the same ground-truth markup (bottom). The red cells indicate the attention at same symbol a. Same symbols might be written in very different styles while share invariant features that represent the same semantic meaning. (Color figure online)

Recently, methods based on deep neural networks have been proposed for image-to-markup generation and achieved great success [10, 11, 20, 28, 29]. For example, the model WYGIWYS extended the attention-based encoder-decoder architecture to image-to-markup problem [11]. It encodes printed MEs images with a multi-layer CNN and a bidirectional LSTM and employs a LSTM as the recurrent attention based decoder to generate the target LaTeX format markup. To speedup the method, authors of [10] improved the original WYGIWYS with coarse-to-fine attention and performed experiments on synthetic handwritten MEs. These studies show the data-driven attention-based models can be as effective as the grammar-based approaches while exploiting no prior knowledge of the language.

For the image-to-markup problem, it is especially important to ensure the translation of each local region of the input image. Motivated by this observation, the model WAP [29] records the history of attention at all local regions for improving the coverage of translation. An improved version of WAP uses deep gated recurrent unit (GRU) to encode the online trajectory information of handwritten MEs [28], and has achieved the state-of-the-art performance using an ensemble of five models.

Despite the progresses achieved so far, handwritten-image-to-markup generation is still a very challenging task due to the highly variable handwriting styles compared with printed images, see Fig. 1. On the other hand, well-annotated handwritten MEs are rather scarce. For example, the currently largest public database of handwritten mathematical expression recognition, the Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME) database, contains only 8, 836 MEs. In order to alleviate the contradiction between the limited training data and the great writing-style variation, it is common to augment the training dataset by distorting the input images [20].

To overcome the scarcity of annotated training data in handwritten-image-to-markup generation, we propose an attention-based model with paired adversarial learning for learning semantic-invariant features. The main contributions of this paper are as follows: (1) we present a scalable paired adversarial learning (PAL) method incorporating the prior knowledge of standard templates to guide the attention-based model to learn intrinsic semantic-invariant features; (2) we use a fully convolutional attention based decoder to speed up the image-to-markup decoding without losing accuracy; (3) we introduce a novel multi-directional transition layer that can be easily extended to other deep convolutional networks for accessing 2D contextual information.

2 Background

Before describing our proposed method, we briefly review the generative adversarial network (GAN) [14] in Sect. 2.1 and the convolutional attention (Conv-Attention) model [13] proposed for sequence-to-sequence learning in Sect. 2.2.

2.1 Generative Adversarial Network

GAN is a well-known adversarial learning method originally presented for generative learning by Goodfellow et al. [14]. It generally consists of a generator G and a discriminator D, which are trained with conflicting objectives:

$$\begin{aligned} \min \limits _{G}\max \limits _{D} V(G,D) = E_{\mathbf{x}\sim p_{data}(\mathbf {x})}[\log D(\mathbf x )] + E_{\mathbf{z}\sim p_\mathbf{z }(\mathbf {z})}[\log (1 - D(G(\mathbf z )))] \end{aligned}$$
(1)

where \(\mathbf x \) denotes the target real sample, \(\mathbf z \) is the input noise and \(D(\mathbf x )\) is the probability that the sample is real. G tries to forge real samples to confuse D while D tries to distinguish fake samples from real ones. Adversarial learning method does force both G and D to improve and has been proven effective in producing highly realistic samples [6, 12, 25].

Recently, the idea of adversarial learning in GAN has been applied to the image-to-image translation task and demonstrates very encouraging results [18, 31]. It is interesting to observe that D successfully guides G to learn the style information from the two domains and realize style transfer from the source domain to the target domain. Another work related to our proposal is the domain adaptation with GAN [5], in which G is guided by D to find a domain-invariant representation to represent two domains with different distributions. Inspired by these works, we design an attention-based model to grasp semantic-invariant features from symbols with different writing-styles under the adversarial learning framework.

2.2 Convolutional Attention

Though recurrent attention performs well in dealing with sequential problems, it is still time consuming due to its sequential structure. In this section, we briefly introduce the fully convolutional attention (Conv-Attention) [13], which is proposed for machine translation and shows competitive performance while being more efficient.

Suppose the input sequence is \( \mathbf w = (w_{1},\cdots ,w_{N})\). \(\mathbf w \) is then embedded in a distributional space as and the absolute position of input elements is embedded as . The input of the encoder is finally represented as \( \mathbf e = (x_{1} + p_{1},\cdots ,x_{N}+p_{N})\) to guarantee the model’s sense of order and the output sequence of the encoder is . This process has been also applied to the output elements already generated by the decoder.

The encoder and the Conv-Attention based decoder share a simple block structure. Each block (or referred to layer) contains a one dimensional convolution and a subsequent non-linearity, and computes the output states with a fixed number of input elements. Each convolution kernel of the decoder blocks is parameterized as with a base , where k is the kernel width and D is the channel dimension of the input features. This convolution kernel maps k concatenated input elements which are embedded in D dimensions to a single output . The following non-linearity of one dimensional convolution is chosen as gated linear units (GLU) [9] that implements a gating mechanism over each output element :

$$\begin{aligned} GLU(o_j) = o_{j1} \odot \sigma (o_{j2}) \end{aligned}$$
(2)

where the \(\odot \) denotes the point-wise multiplication and gates \(\sigma (o_{j2})\) determine which parts of \( o_{j1}\) are relevant. The output is half the channel dimension of the input \(o_{j}\).

Conv-Attention uses a separate attention mechanism for each decoder block. It first computes the state summary \(s_i^l\) with previous target embedding and current hidden state of the l-th block as:

$$\begin{aligned} s_i^l = W^l_sh_i^l + b^l_s + t_i \end{aligned}$$
(3)

Next, attention score \(\alpha _{i,j}^l\) of state i and source element j is calculated via a dot-product between \(f_j\) of the feature sequence \(\mathbf f \) and state summary \(s_i^l\):

$$\begin{aligned} \alpha _{i,j}^l = \frac{ \exp ( s_i^l \cdot f_j )}{\sum _{w=1}^{N}\exp ( s_i^l \cdot f_w )} \end{aligned}$$
(4)

After the weights \(\alpha _{i,j}^l\) have been computed, the context vector is calculated as:

$$\begin{aligned} c_{i}^l = \sum _{j=1}^{N} \alpha _{i,j}^l(f_j + e_j) \end{aligned}$$
(5)

Then, the context vector is simply added to the corresponding hidden feature \( h_i^{l}\). This operation can be considered as attention with multiple hops [27], which improves the model’s ability to access more attention history. Furthermore, to improve the information flow between blocks, residual connections are added from input to output as ResNet in [16].

3 Paired Adversarial Learning

The motivation of our work is to make the model learn semantic-invariant features of patterns to conquer the difficulties caused by the writing-style variation and the small sample size. Roughly speaking, for each handwritten image in the training set, we first generate its printed image template by compiling the LaTeX format label with a general LaTeX editor. Then, we force the attention-based encode-decoder model to extract similar features for both the handwritten image and its printed image template, which is implemented under the adversarial learning framework.

Fig. 2.
figure 2

Architecture of the paired adversarial learning (PAL). When training, each handwritten image is input with its paired printed template (bottom left). The encoder-decoder model and the discriminator are trained alternatively. We use Conv-Attention based decoder here to speed up the image-to-markup decoding. Theoretically, the attention model can also be substituted with any standard recurrent attention.

The proposed model consists of three parts (see Fig. 2): an encoder that extracts features from images, a decoder that parses the sequence of features outputted by the encoder and generates the markup, and a discriminator trained against the encoder-decoder model to force it to learn the semantic-invariant feature of each pattern. In the following subsections, we will sequentially introduce the encoder, the decoder, the learning objective of the proposed paired adversarial learning method, and the training algorithm.

3.1 Multi-directional Encoder

The basic architecture of the encoder in this work is adapted from the fully convolutional network (FCN) model in [29]. The difference is that we introduce a novel layer, named MD-transition layer, equipped after each convolutional block of the deep FCN model. We utilize the multi-dimensional long short-term memory (MDLSTM) [15] to improve FCN’s ability to access the 2D contextual information and apply a pooling layer [21] before the MDLSTM layer to improve computation efficiency. We refer to this architecture as MD-transition layer.

MDLSTM employs LSTM layers in up, down, left and right directions. Different LSTM layers are executed in parallel to improve the computation efficiency. After the LSTM layers, we collapse the feature maps of different directions by simply summing them up. The LSTM layers in the horizontal and vertical directions are calculated as:

$$\begin{aligned} (y_{i,j},c_{i,j}) = LSTM(x_{i,j},y_{i\pm 1,j},y_{i,j \pm 1},c_{i\pm 1,j},c_{i,j \pm 1}) \end{aligned}$$
(6)

where y and c denote the output feature vector and inner state of the cell, respectively, and \(x_{i,j}\) denotes the input vector of the feature map at position (ij). The LSTM denotes the mapping function of general LSTM networks which process the input sequence over space or time. With this set up, the subsequent FCN block is enabled to access more past and future contextual information in both horizontal and vertical directions.

3.2 Decoder with Convolutional Attention

We extend the decoder with Conv-Attention [13] to generate markup for the images and speed up the decoding. Different from machine translation, image-to-markup generation is a 2D-to-sequence problem. Since the outputs of the encoder are in the form of 2D feature maps rather than feature sequences, we have to propose a conversion method that preserves feature information as much as possible.

Suppose the output feature map of the multi-directional feature extractor sizes \(H \times W \times D\). We split the feature map by columns and then concatenate them to get the feature sequence . Then, in order to guarantee the position information during conversion, we add \(\mathbf f \) with the embedding of the absolute position and get embedded feature sequence . Here, \(\mathbf e \) is not the input of the encoder as the original work. After getting \(\mathbf f \) and \(\mathbf e \), we compute the context vectors as Sect. 2.2. With rearrangement of the feature map and position embedding, Conv-Attention can be successfully applied to image-to-markup generation. In this study, blocks number l of the Conv-Attention is set to 3. Via the multi-step attention, the model is enabled to access more attention history, thereby improving the ability of consistent tracking with its attention.

3.3 Paired Adversarial Learning

Adversarial learning for image-to-markup task are more complex than image generation, since \(mismatch \) between two sequences of feature vectors can easily cause the discriminator converging to irrelevant features, and thus lose the ability of guiding the encoder-decoder model to learn the semantic-invariant features. To settle this problem, first, we pair each handwritten image with its same-size printed template to ensure that the length of the two feature sequences are same. Second, since the labels of the paired images are same, the feature vectors at the same position of these two feature sequences are forced to be extracted from related regions with the attention mechanism.

Specifically, let denote the feature sequence at the last feature extraction layer of the decoder. Here, x is the input handwritten image \(x_h\) or its paired printed template \(x_p\) and \(\phi _{E}\) denotes the parameters of the encoder-decoder model. Our model learns the semantic-invariant features with the guide of a discriminator D which judges whether a feature vector comes from the handwritten images or the printed templates. Let \(D(a_i(x,\phi _{E}),\phi _{D})\) represent the probability that feature vector \(a_i\) comes from a printed image and \(\phi _{D}\) denotes the parameters of D. The objective function is defined as:

$$\begin{aligned} \begin{array}{rrr} \mathcal {L}_D = {E}_{(x_h,x_p)\sim X} [ {E}_{a_i(x_p,\phi _{E}) \sim \mathbf a }[\log D(a_i(x_p,\phi _{E}),\phi _{D})]\ + \ \ \\ {E}_{a_i(x_h,\phi _{E}) \sim \mathbf a }[\log (1 - D(a_i(x_h,\phi _{E}),\phi _{D}) ] ] \\ \end{array} \end{aligned}$$
(7)

where \(X=\{(x_h,x_p)\}\) is the set of paired training images. D is optimized to maximize the probability of assigning correct labels to the extracted featu res by maximizing \(\mathcal {L}_D\). On the contrary, the encoder-decoder model is trained to learn semantic-invariant features to confuse D by minimizing \(\mathcal {L}_D\).

Moreover, the primary goal of the encoder-decoder model is to extract discriminative features and generate the correct markup. Thus, the decoder has to convert the feature sequence to the markup by a classification layer as:

$$\begin{aligned} p(y_{a_i} = y_i| x;\phi _{E}) = \frac{\exp (C(y_{a_i} = y_i|\mathbf a (x,\phi _E))}{\sum _{l=1}^{L} \exp (C(y_{a_i} = l|\mathbf a (x,\phi _E))} \end{aligned}$$
(8)

Here, \(y_i \in Y = \{1,\ldots ,L\}, L\) denotes the total class number of the label set, \(y_{a_i}\) is the prediction of feature vector \(a_i\) in the feature sequence \(\mathbf a (x,\phi _{E})\).

Ideally, features extracted from both the printed and handwritten images should be classified correctly with high probabilities. The cross-entropy objective function for classifying the features learned from printed images is defined as:

$$\begin{aligned} \mathcal {L}_{C_p} = -{E}_{{x_p}\sim X_p} [\sum \nolimits _{i=1}^{T}\log p(y_{a_i} = y_i| x_p;\phi _{E}) ] \end{aligned}$$
(9)

where \(X_p=\{x_p\}\) is the set of printed image templates. Similarly, the loss function for classifying the features learned from handwritten images is defined as:

$$\begin{aligned} \mathcal {L}_{C_h} = -{E}_{{x_h}\sim X_h} [\sum \nolimits _{i=1}^{T}\log p(y_{a_i} = y_i| x_h;\phi _{E}) ] \end{aligned}$$
(10)

where \(X_h=\{x_h\}\) is the set of handwritten images.

In summary, we train the attention-based encoder-decoder model by minimizing the loss function of:

$$\begin{aligned} \mathcal {L}_E = \mathcal {L}_{C_p} + \mathcal {L}_{C_h} + \lambda \mathcal {L}_D \end{aligned}$$
(11)

\(\lambda \) is a hyper-parameter that controls the tradeoff between the discriminative features and the semantic-invariant features. When \(\lambda = 0\), the method is a general attention-based encoder-decoder model trained on the paired samples. When \(\lambda \) increases, the method will focus more on learning the semantic-invariant features and extract less discriminative features for the classification layer to generate the predictions.

3.4 Training Procedure

The encoder-decoder model and discriminator D are trained jointly with the paired adversarial learning algorithm. D is optimized with the objective of distinguishing the sequences of feature vectors extracted from the handwritten images and the printed templates. Contrarily, the encoder-decoder model is optimized to extract more sophisticated semantic-invariant features to fool D. Meanwhile, the encoder-decoder model is trained to maximize the probability of ground-truth markup symbols of the input images. The importance of these two objective function is balanced via the hyper-parameter \(\lambda \).

See details in Algorithm 1. We sample minibatch of the paired samples to train the encoder-decoder model and D for every training cycle. The encoder-decoder model is trained one time first, and D is trained k times then. The parameters of these models are updated by adaptive moment estimation (Adam). Specifically, we update the parameters for the encode-decoder model as:

$$\begin{aligned} \phi _{E} \leftarrow \phi _{E} - Adam(\frac{\partial (\mathcal {L}_{C_p} + \mathcal {L}_{C_h} + \lambda \mathcal {L}_D)}{\partial \phi _{E}},\eta _E) \end{aligned}$$
(12)

And for the discriminator by:

$$\begin{aligned} \phi _{D} \leftarrow \phi _{D} + Adam(\frac{\partial \mathcal {L}_D}{\partial \phi _{D}},\eta _D) \end{aligned}$$
(13)

Here, the Adam is the function to compute the updated value of the adaptive moment estimation with the gradient and learning rate, \(\eta _E\) denotes the learning rate for the encoder-decoder model and \(\eta _D\) denotes the learning rate of the discriminator. See more details in Algorithm 1.

figure a

4 Experiments

4.1 Datasets

We validate our proposal on handwritten-image-to-LaTeX generation with the large public dataset available from the Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME) [22]. CROHME 2013 dataset consists of 8,836 training samples and 671 test samples. The training set of CROHME 2014 is same as CROHME 2013, but the 986 handwritten samples of the test set are newly collected and labeled. We use the CROHME 2013 test set as the validation set to estimate our model during training process and test the final model on the CROHME 2014 test set. The number of symbol classes for both the CROHME 2013 and CROHME 2014 are 101. Each mathematical expression in the dataset is stored in InkML format, which contains the trajectory coordinates of the handwritten strokes and the LaTeX and MathML format markup ground truth. Models for handwritten-image-to-LaTeX generation are evaluated at expression level by the expression recognition rate (ExpRate), which is the index that ranks the participate systems in all the CROHME competitions. A markup generation of the input image is right if the markup for all the symbols and spatial operators are generated correctly with the right order. This expression level metric is useful to evaluate all the symbols and their structures are translated rightly.

In this study, we have not used the online trajectory information of the strokes, we just connect adjacent coordinate points in the same strokes to get the offline images of the handwritten MEs. Each printed image template of training data is simply gotten by compiling the LaTeX format label with a general LaTeX editor. Then, all the images are normalized to the height of 128 pixels. And images in each minibatch are all padded to the same width as the largest one with background pixels. We use the preprocessing to ensure that the features extracted from different images are the same size.

Table 1. Configurations of the PAL model

4.2 Model Configurations

In this section, we briefly summarize the configurations of our proposed PAL model. See details in Table 1. The encoder model is adapted from the deep FCN of WAP [29], but equipped with a MD-transition layer after each CNN block. Each CNN block of the encoder contains four convolutional layers, and each convolutional layer is equipped with a batch normalization layer [17] and a rectified linear unit (ReLU) [19]. The filter size of the convolutional layers is \(3\,\times \,3\) and convolution stride size is \(1\,\times \,1\). When a feature map is input to the hidden convolutional layer, it is zero-padded by one pixel to keep the size fixed. In addition to the size, channels of the feature maps are also fixed within the CNN blocks. Every pooling layer in the MD-transition layer is set as max-pooling with 2 \(\times \) 2 kernel and 2 \(\times \) 2 stride to reduce the size of the feature map.

The decoder model consists of 3 Conv-Attention blocks and a subsequent multi-layer perception (MLP). CNN block in the Conv-Attention model contains a one-dimensional convolutional layer with kernel width 3 and stride size 1. And the one-dimensional convolutional layer is equipped with a GLU nonlinear activation function introduced in the Sect. 2.2. The discriminator D is a MLP with two fully connected layers. We employ dropout for our proposal to prevent the over-fitting. L in the table denotes the total class number of the symbols in the markup ground truth. All models are implemented in Torch and trained on 4 Nvidia TITAN X GPUs.

Table 2. ExpRate (%) of different systems on CROHME 2014 test set

4.3 Validation on CROHME

We compare our proposal with the submitted systems from CROHME 2014 and some attention-based models presented recently for handwritten-image-to-LaTeX generation. The results of these systems are listed in Table 2. Systems I to VII are the participants in CROHME 2014 and the next three systems from WYGIWYS\(^*\) to WAP\(^*\) are attention-based models presented recently. To make fair comparison, system III are erased from Table 2 because it has used unofficial extra training data and the attention models listed in Table 2 are all trained with offline images. The ExpRate \(\le 1 (\%), \le 2 (\%), \le 3 (\%)\) denote the accuracy for markup generation with one to three symbol-level error and showing the room for the models to be further improved.

Our proposed PAL model outperforms system I, which wins the first place on CROHME 2014 and named seehat, with more than 2% ExpRate. More importantly, it is interesting to observe that the one to three symbol-level error of our proposal has been significantly reduced due to the grasp of semantic-invariant features for each symbol. The sign \(*\) in Table 2 denotes utilizing an ensemble of 5 differently initialized models to improve the performance [29]. WYGIWYS is the first attention-based model proposed for image-to-markup generation [11]. WYGIWYS with ensemble methods finally achieves an ExpRate of 28.70%. The End-to-end indicates encoder-decoder model in [20], which has a similar architecture to WYGIWYS. It achieves a ExpRate of 18.97%, and 35.19% then by distorting the training images and bringing the number of training images to 6 times. WAP\(^*\) here indicates the state-of-the-art model WAP [29] trained with ensemble methods and not uses the online trajectory information like other attention-based model here. We use the same ensemble method to get PAL\(^*\), and the result shows our proposed PAL model outperforms WAP under the same conditions. While the ensemble method can effectively improve performance, it requires much more memory space to run as fast as a single model through parallel computing.

Fig. 3.
figure 3

Examples of the handwritten images and generated markup of our proposal

In order to make the results more intuitive, in Fig. 3 we show some handwritten MEs of the CROHME 2014 test set as well as their markup generated by our proposal. The red symbols of the gen indicate the incorrect markup generations, and the blue symbols of the truth indicate the corresponding right markup in the ground truth or the markup our proposal failed to generate. The results show our proposal are effective in dealing with the complex 2D structures and the symbols with various writing styles. It is worth noting that some symbols are too similar or written too scribbled, even humans could be confused.

4.4 Comparison of Different \(\lambda \)

In this section, we further analyze how the hype-parameter \(\lambda \) in Eq. (11) affects the performance of our proposed PAL model. By balancing the influence of the loss for markup generation and features discrimination, \(\lambda \) controls the tradeoff between discriminative features and semantic-invariant features the encoder-decoder model learned. When \(\lambda \) is small, discriminative features comprise the majority loss of the encoder-decoder model and dominate the gradient backward to it. With the increasing of \(\lambda \), the encoder-decoder model masters more semantic-invariant features of same symbols in the printed templates and the handwritten images. However, when \(\lambda \) going too large, the model will focus too much on semantic-invariant features and even try to generate same feature sequences for both printed and handwritten images to confuse the discriminator. This will lead to less discriminative features for different categories and cause the decreasing of markup generation accuracy. For an extreme case, the model may only pays attention to regions of the background and even generates \(G (x)\) that equals to a constant at each step to fool the discriminator \(D \). Therefore, an appropriate \(\lambda \) plays an important role in the PAL model. We explore different \(\lambda \) for the model while keeping the other configurations of the model fixed the and then evaluate these models with different \(\lambda \) on the CROHME dataset. The results is shown in Fig. 4.

Fig. 4.
figure 4

Comparison of different \(\lambda \) on CROHME dataset

4.5 Analysis of Print Templates

It is worth noting that print templates for training are also crucial to the PAL model. Firstly, just like we cannot use printed English books to teach humans to recognize handwritten Chinese characters, the templates need to have related semantic information with the target images. Thus, the attention-based encoder-decoder model can learn semantic-invariant features for each specific symbol in the paired images. Secondly, the distribution of standard templates needs to be easier to learn. In this way, simple templates can guide the model in dealing with complex samples through paired adversarial learning.

Table 3. Analysis of the influence of printed templates

We first validate the ability of Conv-Attention based encoder-decoder model without paired adversarial learning to generate markup for the print images. Specifically, we compile the LaTeX format markup ground truth of CROHME 2013 test set to get the printed CROHME 2013 test set (CROHME 2013 P) and train the model only on the standard printed templates, see Conv-Attention P in Table 3. Then we get the same model but trained on only the handwritten images (Conv-Attention H). Surprisingly, the accuracy of printed-image-to-markup generation is more than double of the handwritten-image-to-markup generation. However, the model Conv-Attention P appears to have been over fitted in the printed images when tested on the handwritten CROHME 2013 test set.

After that, we mix these two samples to conduct experiments (Conv-Attention P&H). The experimental results show that the distributions of these two kinds of samples is relatively close, and the adding of the printed templates is helpful to the markup generation for the handwritten images. The model increased about \(1\%\) ExpRate compared with Conv-Attention H when validated on CROHME 2013 test set and the generalization is significantly enhanced when test on CROHME 2014 test set. When we train the model with paired adversarial learning (PAL) and set \(\lambda \) as 0.15, we find that this increase becomes even more apparent, whether it is validation or test. We also tested the PAL model on CROHME 2013 P, the result shows that the model does lose some knowledge about the distribution of the print images compared with Conv-Attention P.

In addition, we have made some global distortions for the printed templates to further explore the influence of the print templates’ distribution. We rote the standard printed sample with 4 angles randomly choose from \(-2^\circ \) to \(2^\circ \) with a interval of \(0.5^\circ \) but excluding \(0^\circ \) (the minus sign here represents counterclockwise). Then we add them to the standard printed templates without distortion and re-pair each of these printed templates with the handwritten one that owned the same label. The new \(8,836*5\) image pairs are used to train the model called PAL GD. Interestingly, we find that the accuracy of the validation has been further improved, but the accuracy of the test has slightly decreased. It is believed that if the distortions are done more elaborately, the test accuracy will also be improved, but this contradicts our original intention of training the attention-based model through easy-to-get templates with paired adversarial learning. Therefore, we haven’t conducted further experiments on the distortion.

5 Conclusion

In this paper, we introduce a novel paired adversarial learning to guide the attention-based model to learn the semantic-invariant features as human when focusing attention on specific objects. Our proposal incorporates the prior knowledge of simple templates and improves the performance of an attention-based model on more complex tasks. The proposal performs much better than other systems under the same training conditions on CROHME 2014. We also extend a fully convolutional attention from machine translation to speed-up the decoding of the image-to-markup generation.

In future work, we plan to explore the language model based attention to make the neural network more like human when generating the markup for the input images. We will also apply the paired adversarial learning in more fields such as text string recognition in optical character recognition to improve the models performance.