Abstract
This paper studies the problem of sequential visual processing to solve arithmetic operations using handwritten digits. We feed a sequence of digits with an arithmetic operator to a trained system, and then ask for the resulting symbolic answer. All digits and operators in the input sequence are images, while the output is a real number rounded up. The proposed architecture is a hybrid recurrent-convolutional network with a regression module that is trainable end-to-end. The experimental results show that the proposed architecture is able to add or subtract sequences of up to five elements with high accuracy, and that long sequences require long training times.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Our capacity to recognize numbers and the ability to do arithmetics with them is remarkable. We use it everyday to calculate bills, check available cash, estimate times and dates, and many other activities. Although machines are more precise and faster than humans to operate numbers, we can recognize them visually in a variety of forms and apply abstract concepts over them. An example of this is the ability to perform arithmetic operations involving the visual interpretation of symbol sequences.
In this work, we study the problem of sequential processing of visual signals. In particular, we study the problem of computing arithmetic operations on sequences of handwritten digits, as a proxy task to understand the difficulty of sequential-visual analysis tasks. At the core of these problems is the ability to recognize a set of objects, analyze their spatial arrangement, and apply some reasoning to make a decision.
We propose and evaluate a deep learning architecture to solve arithmetic operations from visual information. The input to this architecture is a sequence of images with handwritten digits and handwritten symbols of addition or subtraction. The output is a single decimal number representing the result of the depicted operation. We evaluate the proposed architecture using sequences of different lengths to assess its robustness for long-term recognition. Our results indicate that the architecture can predict the correct output for sequences of length 3 and 5, but longer sequences are harder to train in a reasonable amount of time.
2 Previous Work
Image classification and object recognition are vision problems that have experienced unprecedented progress during the last few years thanks to the resurgence of convolutional networks [7]. Originally proposed during the late 80s [8], convolutional networks are a differentiable model that makes multiple non-linear transformations to an image to extract relevant features for a particular task.
Recurrent Neural Networks (RNNs) are a family of models that have shown excellent results for modeling sequential data. They are being actively investigated for language modeling [9], and also incorporated into vision systems to predict sentences [6], segment images [12], and analyze video [1]. One of the most interesting properties of RNN models such as the Long-Short Term Memory (LSTM) [3] is that they can learn to remember long-term dependencies of sequential data. In this way, a single observation made at time t may change the whole interpretation of the input sequence or signal. In our work, we evaluate the capacity of LSTM models to understand the order of digits in a sequence and produce correct interpretations associated to the corresponding arithmetic operation.
Digit classification is one of the most widely studied problems in the machine learning community [11], other related studies include image generation [2], visual attention models [10], and spatial transformations of objects [5]. In this frame the closest study our work is the one proposed in [4] where the authors use a deep neural network (DNN) to process two input images, each showing a 7-digit number, in order to produce an image displaying the number result of an arithmetic operation. Unlike their model, our proposal works on sequences of varying length composed by images of handwritten digits and arithmetic symbols rather than images of fixed length numbers made of digits written electronically in a standard font. However both models are incomparable since their experimental conditions (input, output and objective function) are fundamentally different.
3 Handwritten Arithmetics with Deep Learning
3.1 Problem Description
Assume two participants involved in the task of solving an arithmetic operation. The first participant writes the operation by hand using a pen, while the second participant observes from left to right each character of the operation one at a time, and computes the result. The second participant has to remember important information of the sequence to interpret the numbers and the operation correctly in order to compute the result. In our setup the first participant is a computer program that generates arithmetic operations and feeds them as sequences to the second participant which is the proposed algorithm.
An easy way to model this problem with high accuracy would be to use a handwritten digit classifier and a module engineered to keep the prediction results and solve the operation deterministically. However, we want to study the properties of a system that has to learn the process of remembering, interpreting the input, and associating the correct output. The design of such system is presented in the following subsections.
3.2 Deep Learning Architecture
We propose a deep learning architecture with three main components: a convolutional network, a recurrent network, and a regressor. The convolutional network takes as input images of handwritten digits and symbols and extracts from them visual features. The recurrent network (RNN) reads these visual features in order to recognize the most important information and to keep a compact representation of the entire sequence. Finally, the regressor reads and interprets the last representation of the sequence to produce a real number as result. Figure 1 depicts the proposed architecture.
Convolutional Network. The convnet is a function that builds a feature embedding for an image. We use a three-layer convnet that takes inputs of \(28\times 28\) pixels with a single gray-scale channel. The first two are convolutional layers that transform the input into a feature map that applies the ReLU non-linearity (\(R(x)=max(0,x)\)) and a \(2\times 2\) max-pooling operation. The first layer has 32 filters of \(5\times 5\) pixels and the second layer has 64 filters also of \(5\times 5\) pixels. At last there is a fully connected layer that embeds the output into a 1.024 feature vector.
Recurrent Network. An RNN models a dynamic system whose inputs vary with time and can take a sequence \(X=\{\phi (x_1), \phi (x_2), \ldots , \phi (x_T)\}\) of arbitrary length T. The goal of an RNN is to progressively encode information of the sequence in a vector representation called hidden state. Each element of X is processed sequentially, and the output is the new hidden state \(h_t\) that depends on the previous state \(h_{t-1}\) and the current observation \(\phi (x_t)\). A simple RNN function can be modeled as
where \(W^r\) and U are linear transformations of the inputs, and \(\tanh \) is an element-wise nonlinear activation function. Simple RNN models are difficult to train when input sequences are long because the gradient tends to vanish through time. However, alternative formulations of RNN functions have been proposed, which have trainable memory mechanisms for dealing with longer sequences. In particular, we use the Long-Short Term Memory (LSTM) unit.
An LSTM unit has a memory vector c in addition to the hidden state h, and it implements gates that allow reading, writing and resetting information into the memory vector. The memory and hidden state vectors are formulated as
where \(c_t\) is the memory content at time t, i and f are functions that control writing and resetting the memory content, g is the transformed input, and o is a function that controls the output to the hidden state. Notice that the writing (i), resetting (f), and reading (o) gates have element-wise multiplicative interaction (\(\odot \)) with the information in the memory vector.
We adopt the LSTM architecture as the recurrent network in our model, using a hidden state \(h\in \mathbb {R}^{512}\), and input vectors \(\phi (x)\in \mathbb {R}^{1024}\) produced by the convnet. The LSTM architecture is useful in the problem of processing sequences of handwritten digits, because the memory structure allows to preserve information of the order in which digits appear in the sequence, and also the position of the arithmetic operator. Getting this information right in the final representation is crucial to produce a correct interpretation of the operation.
Regression Network. The final component of our architecture is a network that reads the last state of the recurrent network (\(h_T\)) and computes the result of the arithmetic operation. We use a two-layer, fully connected network to transform \(h_T\) into a single real number. This network is supervised with respect to the true result of an arithmetic operation using the regression loss function
where y is the ground truth result. This loss function is also known as smooth-L1 loss, and has very stable gradients for regression problems of arbitrary output. It has constant gradient if the absolute value of the error is greater than 1, and linear gradient when the error is smaller than 1. It works well in our arithmetic operations problem, because large sequences usually result in outputs with big numbers.
4 Experiments and Results
We evaluated the proposed architecture with three setups considering only additions, only substractions and a mix considering both. In all cases, we conducted experiments with sequences involving two operands and one operator. Both operands are natural numbers with the same number of digits. We considered operands with one, two and three digits each.
4.1 Dataset and Sequences
To generate sequences for feeding the proposed architecture we use the MNIST handwritten digit database, which is composed by 55.000 training examples and 10.000 testing examples. This dataset was extended with a set of arithmetic symbol images of which 144 are for training and 36 for testing. Each dataset example is a \(28\times 28\) pixels gray scale image.
We generate sequences of arithmetic operations as follows, the operator is chosen from the set of symbols depending on the model that we are training, and the operands are constructed digit by digit. At each position of the sequence, we sample a random digit from the MNIST database with uniform probability. Corresponding ground truth is also calculated. Both training and testing sequences were generated following this procedure, but using separate sets of data.
4.2 Training
The CNN was pre-trained to classify each of the 10 digits with a softmax classification layer. This layer was connected to the last fully connected layer of the proposed architecture with 1.024 features. The softmax classifier was removed, and the weights of the network were used as initialization values. This architecture was trained end-to-end using the Adam optimization algorithm with a learning rate of 0,0001. Training was run until achieving a training accuracy of at least 95%. We used mini-batches of 64 sequences, each sequence with 3 to 7 images. The hyper-parameters of the network were cross-validated until a stable configuration was found.
Table 1 presents an overview of the training results for the considered models. Notice how longer sequences need longer training times to reach a useful solution. This is explained by the large amount of examples that an operation can have as we add more digits to the operands. The combination of digits leads to an exponential growth on the number of potential operations to consider, making harder the problem of training the model for long sequences.
Figure 2 shows the performance of the addition and subtraction model for the three considered lengths of sequences, a similar behavior was described for the only addition and only subtraction models. First thing to notice is that mini-batch error is inversely proportional with the training accuracy as the number of iterations grows. It can also be observed that for longer sequences, training error is higher during the whole training session. This is due to the range of the regression output, as well as the difficulty of training longer sequences. Interestingly, subtraction reaches the best performance in less iterations than addition, and similarly happens with the model that computes both addition and subtraction.
4.3 Testing
In general, all tested models are relatively successful as shown in Table 2. Accuracy was computed by rounding up the predictions and comparing the percent of matches with the ground truth. For sequences of length 3 we reach accuracy above 97%, more than 90% for sequences of length 5, and between 75% and 85% for sequences of length 7. Longer sequences are more difficult to interpret possibly due to the lack of robustness of the regression network.
Among all three models, subtraction exhibits the best accuracy. The testing accuracy of the addition-subtraction model is slightly lower than the testing accuracy for subtraction only. This can be attributed to the introduction of addition operations, whose symbol seems to be harder to identify than the subtraction symbol.
Table 3 depicts some sample test sequences. The failure cases for short sequences are typically due to a few decimals from the ground truth. However for longer sequences the error can be off by a large margin. Although our goal was never to achieve 100% accuracy on the output prediction, but rather to study the problem of visual sequences analysis in the context of RNNs, the proposed system was still able to achieve a good performance, motivating further studies in this direction in problems where needs to be taken into account long-term dependencies in visual sequential data.
5 Conclusions and Future Work
This work presented and evaluated an architecture to compute arithmetic operations from sequences of hand-written digits. The model is very successful at predicting the correct output for sequences with operands of one or two digits each. However, the model becomes harder to train for longer sequences, reaching a moderately successful result for operands with three digits. The proposed architecture has also shown the effectiveness of RNN models for visual sequence analysis. In particular LSTM capabilities to memorize and summarize sequential information. In our future work, we aim to explore more efficient training strategies to optimize the exploration of combinations of operations of certain length. We also plan to experiment with other operators in arbitrary positions of the sequence.
References
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Gregor, K., Danihelka, I., Graves, A., Wierstra, D.: Draw: a recurrent neural network for image generation. arXiv preprint arXiv:1502.04623 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hoshen, Y., Peleg, S.: Visual learning of arithmetic operations. arXiv preprint arXiv:1506.02264 (2015)
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2008–2016 (2015)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, vol. 2, p. 3 (2010)
Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2014)
Sermanet, P., Chintala, S., LeCun, Y.: Convolutional neural networks applied to house numbers digit classification. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 3288–3291. IEEE (2012)
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1529–1537 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Pérez, A., Quevedo, A., Caicedo, J.C. (2017). Computing Arithmetic Operations on Sequences of Handwritten Digits. In: Beltrán-Castañón, C., Nyström, I., Famili, F. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2016. Lecture Notes in Computer Science(), vol 10125. Springer, Cham. https://doi.org/10.1007/978-3-319-52277-7_48
Download citation
DOI: https://doi.org/10.1007/978-3-319-52277-7_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52276-0
Online ISBN: 978-3-319-52277-7
eBook Packages: Computer ScienceComputer Science (R0)