1 Introduction

Chinese domestic value-added tax (VAT) invoice is an important accounting and billing document and is a corporate tax certificate, and it is widely present in dealings among enterprises. The format of it is under strict control of State Administration of Taxation. Financial Sharing Centre of big or medium-sized enterprises need to handle a large number of VAT invoices every day, but these invoices are often handled manually in poor efficiency. They need automation of unsupervised processing systems for VAT invoices to reduce costs and also to promote their financial management capability [1]. There are some projects of this kind that have been built or have been bringing forth to build. The undergoing of an enterprise internal ERP plans is providing a good infrastructure for it, and also, the developing of image processing technologies such as text detection, text recognition and others are coming into a state of commercial feasibility for it, with some extra efforts we can turn the VAT invoice image recognition and processing automation into reality.

Due to the large variability of text patterns and the highly complicated background, the recognition and processing for photo VAT invoice images are much more challenging than the scanned ones. An overview of the network architecture is presented in Fig. 1. It consists of a number of convolutional layers, corner points of text bounding boxes, segmentation maps for text, and layout information for regressing the text box locations, encoder for embedding proposals of varying sizes to fixed-length vectors, and an attention-based Long Short-Term Memory (LSTM) decoder for word recognition. Via this framework, an automatic VAT invoice recognition and processing system is built and implemented.

Fig. 1.
figure 1

Model overview. The network takes an image as input, and outputs both text bounding boxes and text labels.

We validate the effectiveness of our method on our accumulated VAT invoice image datasets in the enterprise financial management scenario. The results show the advantages of the proposed algorithm in accuracy and applicability.

The contributions of this paper are three-fold: (1) We propose a unified framework for processing and recognizing the VAT invoices, which can be trained and evaluated end-to-end. (2) Our method can simultaneously handle the challenges (such as rotation, varying aspect ratios, very close instances) in multi-oriented text in VAT invoice images. (3) We take invoice layout information into consideration and use some rule to regress and constrain the text bounding boxes.

2 Related Work

An automatic VAT invoice recognition and processing system essentially includes two tasks: text detection and word recognition. In this section, we present a brief introduction to related works on text detection, word recognition, and text spotting systems for VAT invoice that combine both. The text detection algorithm has developed rapidly in recent years. It can be roughly classified into two categories: horizontal text detection and skew text detection. For horizontal text detection, a number of approaches are proposed to detect words directly in the images using DNN based techniques, and it is similar to the method of object detection. Tian et al. [2] develop a vertical anchor mechanism, and propose a Connectionist Text Proposal Network (CTPN) to accurately localize text lines in image at ECCV 2016. The latest approach to skew text detection is SegLink [3] and Corner Localization and Region Segmentation proposed by Lyu [4]. SegLink [3] predicts text segments and the linkage of them in a SSD style network and links the segments to text boxes, in order to handle long oriented text in natural scene. Lyu et al. [4] propose to detect scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions. Word recognition has not made much progress in the last two years. There are two main methods, one of the methods is proposed by Shi et al. [5]. It is a novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, while the another method is presented by Lee et al. [6] which use recursive recurrent neural networks with attention modeling for lexicon-free optical character recognition in natural scene images. Text spotting needs to handle both text detection and word recognition. Li et al. [7] proposed a unified network that simultaneously localizes and recognizes text with a single forward pass, avoiding intermediate processes like image cropping and feature re-calculation, word separation, or character grouping. Combining with specific application scenarios, Xie et al. [1] proposed to use many traditional images processing technology to develop the invoice automatic recognition and processing system.

3 Approach

3.1 Overall Architecture

The whole system architecture is illustrated in Fig. 1. It includes two parts: text detection network (TDN) and text recognition network (TRN). Text detection network aims to localize text in images and generate bounding boxes for words. Text recognition network recognizes words in the detected bounding boxes based on the previous text detection network. Our model is motivated by recent progresses in FPN [8], DSSD [9], Instance FCN models [10] and sequence-to-sequence learning [11, 12], and we also take the special characteristics of text and invoice layout information into consideration. In this section, we present a detailed description of the whole system.

3.2 Text Detection Network

The network of our method is a fully convolutional network (FCN) that plays the roles of feature extraction, corner detection, position-sensitive segmentation and fully convolutional segmentation. Inspired by the good performance achieved by FPN [8] and DSSD [9], we adopt the backbone in FPN/DSSD architecture to extract features. In detail, we convert the fc6 and fc7 in the VGG16 to convolutional layers and name them conv6 and conv7 respectively. Then several extra convolutional layers (conv8, conv9, conv10, conv11) are stacked above conv7 to enlarge the receptive fields of extracted features. After that, a few deconvolution modules proposed in DSSD [9] are used in a top-down pathway (Fig. 2). Particularly, to detect text with different sizes well, we cascade deconvolution modules with 256 channels from conv11 to conv3 (the features from conv10, conv9, conv8, conv7, conv4, conv3 are reused), and 6 deconvolution modules are built in total. Including the features of conv11, we name those output features F3, F4, F7, F8, F9, F10 and F11 for convenience. In the end, the feature extracted by conv11 and deconvolution modules, which have richer feature representations, are used to detect corner points and predict position-sensitive maps. A large number of candidate bounding boxes can be generated after sampling and grouping corner points. Inspired by [4], we adopt the methods which score the candidate boxes by Rotated Position-Sensitive Average ROI Pooling and detect the arbitrary-oriented text by using position-sensitive segmentation maps.

Fig. 2.
figure 2

Network architecture. The backbone is adapted from DSSD [9].

But unlike the above methods [4] that regress text boxes or segments directly, we still added the supplementary method, which uses the invoice layout information in the image (such as the form line, red chop and two-dimensional code.) detected by FCN architecture [13] to constrain the detected bounding boxes and to improve the accuracy and efficiency for text detection. Combine with the above method, we use the NMS and some rules to filter out the candidate boxes with low score and get the RoIs. The detected bounding boxes are merged via NMS according to their textness scores and fed into Text Recognition Network (TRN) for text recognition.

3.3 Text Recognition Network

To process RoIs of different scales and aspect ratios in a unified way, most existing works re-sample regions into fixed-size feature maps via pooling [14]. However, for text, this approach may lead to significant distortion due to the large variation of word lengths. For example, it may be unreasonable to encode short words like “Dr” and long words like “congratulations” into feature maps of the same size. In this work, we propose to re-sample regions according to their respective aspect ratios, and then use RNNs to encode the resulting feature maps of different lengths into fixed length vectors. The whole region feature encoding process is illustrated in Fig. 3.

Fig. 3.
figure 3

Region Features Encoder (RFE). The region features after RoI pooling are not required to be of the same size. In contrast, they are calculated according to aspect ratio of each bounding box, with height normalized. LSTM is then employed to encode different length region features into the same size.

For an RoI of size h \( \times \) w, we perform spatial max-pooling with a resulting size of

$$ H \times {\text{min}}({\text{W}}_{ \hbox{max} } ,2{\text{Hw}}/{\text{h}}) , $$
(1)

where the expected height H is fixed and the width is adjusted to keep the aspect ratio as 2w/h (twice the original aspect ratio) unless it exceeds the maximum length W max . Note that here we employ a pooling window with an aspect ratio of 1:2, which benefits the recognition of narrow shaped characters, like ‘i’, ‘l’, etc., as stated in [5].

Next, the resampled feature maps are considered as a sequence and fed into RNNs for encoding. Here we use Long-Short Term Memory (LSTM) [11] instead of vanilla RNN to overcome the shortcoming of gradient vanishing or exploding. The feature maps after the above varying-size RoI pooling are denoted as \( {\mathbf{Q}} \in {\text{R}}^{C \times H \times W} \), where \( {\text{W}} = { \hbox{min} }\left( {{\text{W}}_{max} , 2 {\text{H}}w/{\text{h}}} \right) \) is the number of columns and C is the channel size. We flatten the features in each column, and obtain a sequence \( {\mathbf{q}}_{1} , \ldots ,{\mathbf{q}}_{w} \in \varvec{R}^{C \times H} \) which are fed into LSTMs one by one. Each time LSTM units receive one column of feature qt, and update their hidden state ht by a non-linear function: ht = f (qt, ht−1). In this recurrent fashion, the final hidden state hW (with size R = 1024) captures the holistic information of Q and is used as a RoI representation with fixed dimension.

Text recognition aims to predict the text in the detected bounding boxes based on the extracted region features. As shown in Fig. 4, we adopt LSTMs with attention mechanism [12, 15] to decode the sequential features into words.

Fig. 4.
figure 4

Text Recognition Network (TRN). The region features are encoded by one layer of LSTMs, and then decoded in an attention based sequence to sequence manner. Hidden states of encoder at all time steps are reserved and used as context for attention model.

Firstly, hidden states at all steps \( {\mathbf{h}}_{ 1} , \ldots ,{\mathbf{h}}_{\text{W}} \) from RFE are fed into an additional layer of LSTM encoder with 1024 units. We record the hidden state at each time step and form a sequence of \( {\mathbf{V}}\, = \,\left[ {{\mathbf{v}}_{ 1} , \ldots ,{\mathbf{v}}_{\text{W}} } \right] \in \varvec{R}^{R \times W} \). It includes local information at each time step and works as the context for the attention model.

As for decoder LSTMs, the ground-truth word label is adopted as input during training. It can be regarded as a sequence of tokens \( {\text{s}} = \left\{ {{\mathbf{s}}0,{\mathbf{s}} 1, \ldots ,{\mathbf{s}}_{{{\text{T}} + 1}} } \right\} \) where s0 and sT+1 represent the special tokens START and END respectively. We feed decoder LSTMs with T + 2 vectors: \( {\text{x}}_{0} ,{\text{x}}_{ 1} , \ldots ,{\text{x}}_{{{\text{T}} + 1}} \), where \( {\text{x}}_{0} = [{\mathbf{v}}_{\text{W}} ;{\text{Atten}}({\mathbf{V}}, \, 0)] \) is the concatenation of the encoder’s last hidden state vW and the attention output with guidance equals to zero; and \( {\text{x}}_{i} = [\psi (s_{i - 1} ;Atten(\varvec{V},\varvec{h}_{i - 1}^{'} )] \), for \( {\text{i}} = 1, \ldots ,{\text{T}} + 1 \) , is made up of the embedding \( \psi () \) of the (i − 1)-th token si−1 and the attention output guided by the hidden state of decoder LSTMs in the previous time-step \( \varvec{h}_{i - 1}^{'} \). The embedding function \( \psi () \) is defined as a linear layer followed by a tanh non-linearity.

The attention function \( {\mathbf{c}}_{\text{i}} \, = \,{\text{Atten}}({\mathbf{V}},{\mathbf{h}}_{\varvec{i}}^{\varvec{'}} ) \) is defined as follows:

$$ \left\{ {\begin{array}{*{20}l} {g_{j} = \tanh \left( {W_{v} v_{j} + W_{h} h_{i}^{'} } \right),\,j = 1, \ldots ,W,} \hfill \\ {\alpha = softmax\left( {w_{g}^{T} \bullet [g_{1} ,g_{2} \ldots ,g_{w} ]} \right),} \hfill \\ {c_{i} = \sum\nolimits_{j = 1}^{W} {\alpha_{j} v_{j} } } \hfill \\ \end{array} } \right. $$
(2)

where \( {\mathbf{V}}\, = \,\left[ {{\mathbf{v}}_{ 1} , \ldots ,{\mathbf{v}}_{\text{W}} } \right] \) is the variable-length sequence of features to be attended, \( \varvec{h}_{i}^{'} \) is the guidance vector, Wv and Wh are linear embedding weights to be learned, \( \alpha \) is the attention weights of size W, and ci is a weighted sum of input features.

At each time-step \( {\text{t}} = 0,{ 1}, \ldots ,{\text{T}} + 1 \), the decoder LSTMs compute their hidden state \( \varvec{h}_{t}^{'} \) and output vector yt as follows:

$$ \left\{ {\begin{array}{*{20}l} {\varvec{h}_{t}^{'} = f(x_{t} ,\varvec{h}_{t - 1}^{'} )} \hfill \\ {\varvec{y}_{t} = \varphi \left( {\varvec{h}_{t}^{'} } \right) = softmax(W_{o} \varvec{h}_{t}^{'} )} \hfill \\ \end{array} } \right. $$
(3)

Where the LSTM [11] is used for the recurrence formula f(), and Wo linearly transforms hidden states to the output space, including 26 case-insensitive characters, 10 digits, common standard Chinese characters, a token representing all punctuations like “!” and “?”, and a special END token.

At test time, the token with the highest probability in previous output yt is selected as the input token at step t + 1, instead of the ground-truth tokens \( {\text{s}}_{ 1} , \ldots ,{\text{s}}_{\text{T}} \).

The process is started with the START token, and repeated until we get the special END token.

3.4 Loss Functions and Training

As we demonstrate above, our system takes as input of an image, word bounding boxes and their labels during training. For the final outputs of the whole system, we apply a multi-task loss for both detection and recognition.

$$ {\text{L}} = L_{D} + L_{R} $$
(4)

Our text detect network model is trained by the corner detection and position-sensitive segmentation simultaneously. The loss function is defined as:

$$ L_{D} = \frac{1}{{N_{c} }}L_{conf} + \frac{{\lambda_{1} }}{{N_{c} }}L_{loc} + \frac{{\lambda_{2} }}{{N_{s} }}L_{seg} $$
(5)

Where Lconf and Lloc are the loss functions of the score branch for predicting confidence score and the offset branch for localization in the module of corner point detection. Lseg is the loss function of position-sensitive segmentation. Nc is the number of positive default boxes, Ns is the number of pixels in segmentation maps. Nc and Ns are used to normalize the losses of corner point detection and segmentation. \( \lambda_{1} \) and \( \lambda_{2} \) are the balancing factors of the three tasks. In default, we set the \( \lambda_{1} \) to 1 and \( \lambda_{2} \) to 10.

We follow the strategy of text recognition which proposed by Lyu et al. [4] and the loss for training text recognition is.

$$ {\text{L}}_{R} = \frac{1}{{{\text{N}}_{c} }}\sum\nolimits_{i = 1}^{{N_{c} }} {L_{rec} (Y^{(i)} ,s^{(i)} } ) $$
(6)

Where s(i) is the ground-truth tokens for sample i and \( {\text{Y}}_{\left( i \right)} \, = \,\left\{ {{\text{y}}_{0}^{(i)} ,{\text{y}}_{1}^{(i)} , \ldots ,{\text{y}}_{T + 1}^{(i)} } \right\} \) is the corresponding output sequence of decoder LSTMs. \( {\text{Lrec}}({\text{Y}},{\text{s}}) = - \sum\nolimits_{t = 1}^{T + 1} {log\varvec{y}_{t} (s_{t} )} \) denotes the cross entropy loss on \( {\text{y}}_{ 1} , \ldots ,{\text{y}}_{{{\text{T}} + 1}} \), where yt(st) represents the predicted probability of the output being st at time step t and the loss on y0 is ignored.

4 Experiments

In this section, we perform experiments to verify the effectiveness of the proposed method. We use the accumulated VAT invoice image datasets in the enterprise financial management scenario to evaluate the proposed method.

Our method is implemented by using TensorFlow r1.4.1. All the experiments are carried out on a workstation with an Intel Xeon 8-core CPU (2.10 GHz), 2 GeForce GTX 1080 Graphics Cards, and 64 GB RAM. Running on 1 GPUs in parallel, training a batch takes about 1 s. The whole training process takes less than a day.

For different application scenarios of the invoice, scanned invoices and photo invoices achieves different F-measures. The photo invoices is easily influenced by some factors such as size, noise, blur, illumination, contrast and shelter. One contribution of this work is added to the supplementary method,which uses the invoice layout information in the image to improve the accuracy and efficiency of text detection. To validate its effectiveness, we compare the performance of models “Ours FCN-biLATM+NoLayout” and “Ours FCN-biLATM+Layout”. Experiment shows that the model with constrained layout rule significantly better than unconstrained layout rule. As illustrated in Tables 1 and 2, adopting constrained layout rule (“Ours FCN-biLATM+Layout”) instead of unconstrained layout rule (“Ours FCN-biLATM+NoLayout”) makes F-measures increase around 4%.

Table 1. Results on the scanned invoice image datasets. Precision (P) and Recall (R) at maximum F-measure (F) are reported in percentage.
Table 2. Results on the photo invoice image datasets. Precision (P) and Recall (R) at maximum F-measure (F) are reported in percentage.

5 Conclusion

In this paper, we have presented an automatic value-added tax (VAT) invoice recognition and processing system. In this system, VAT invoice can be detected and recognized in a single forward pass efficiently and accurately. Experimental results illustrate that the proposed method can produce an impressive performance in the actual projects of enterprises, and the model with constrained layout rule scenarios significantly better than unconstrained layout rule scenarios. One of potential future work is on maintaining images with other bills and documents.