Keywords

1 Introduction

Our capacity to recognize numbers and the ability to do arithmetics with them is remarkable. We use it everyday to calculate bills, check available cash, estimate times and dates, and many other activities. Although machines are more precise and faster than humans to operate numbers, we can recognize them visually in a variety of forms and apply abstract concepts over them. An example of this is the ability to perform arithmetic operations involving the visual interpretation of symbol sequences.

In this work, we study the problem of sequential processing of visual signals. In particular, we study the problem of computing arithmetic operations on sequences of handwritten digits, as a proxy task to understand the difficulty of sequential-visual analysis tasks. At the core of these problems is the ability to recognize a set of objects, analyze their spatial arrangement, and apply some reasoning to make a decision.

We propose and evaluate a deep learning architecture to solve arithmetic operations from visual information. The input to this architecture is a sequence of images with handwritten digits and handwritten symbols of addition or subtraction. The output is a single decimal number representing the result of the depicted operation. We evaluate the proposed architecture using sequences of different lengths to assess its robustness for long-term recognition. Our results indicate that the architecture can predict the correct output for sequences of length 3 and 5, but longer sequences are harder to train in a reasonable amount of time.

2 Previous Work

Image classification and object recognition are vision problems that have experienced unprecedented progress during the last few years thanks to the resurgence of convolutional networks [7]. Originally proposed during the late 80s [8], convolutional networks are a differentiable model that makes multiple non-linear transformations to an image to extract relevant features for a particular task.

Recurrent Neural Networks (RNNs) are a family of models that have shown excellent results for modeling sequential data. They are being actively investigated for language modeling [9], and also incorporated into vision systems to predict sentences [6], segment images [12], and analyze video [1]. One of the most interesting properties of RNN models such as the Long-Short Term Memory (LSTM) [3] is that they can learn to remember long-term dependencies of sequential data. In this way, a single observation made at time t may change the whole interpretation of the input sequence or signal. In our work, we evaluate the capacity of LSTM models to understand the order of digits in a sequence and produce correct interpretations associated to the corresponding arithmetic operation.

Digit classification is one of the most widely studied problems in the machine learning community [11], other related studies include image generation [2], visual attention models [10], and spatial transformations of objects [5]. In this frame the closest study our work is the one proposed in [4] where the authors use a deep neural network (DNN) to process two input images, each showing a 7-digit number, in order to produce an image displaying the number result of an arithmetic operation. Unlike their model, our proposal works on sequences of varying length composed by images of handwritten digits and arithmetic symbols rather than images of fixed length numbers made of digits written electronically in a standard font. However both models are incomparable since their experimental conditions (input, output and objective function) are fundamentally different.

3 Handwritten Arithmetics with Deep Learning

3.1 Problem Description

Assume two participants involved in the task of solving an arithmetic operation. The first participant writes the operation by hand using a pen, while the second participant observes from left to right each character of the operation one at a time, and computes the result. The second participant has to remember important information of the sequence to interpret the numbers and the operation correctly in order to compute the result. In our setup the first participant is a computer program that generates arithmetic operations and feeds them as sequences to the second participant which is the proposed algorithm.

An easy way to model this problem with high accuracy would be to use a handwritten digit classifier and a module engineered to keep the prediction results and solve the operation deterministically. However, we want to study the properties of a system that has to learn the process of remembering, interpreting the input, and associating the correct output. The design of such system is presented in the following subsections.

3.2 Deep Learning Architecture

We propose a deep learning architecture with three main components: a convolutional network, a recurrent network, and a regressor. The convolutional network takes as input images of handwritten digits and symbols and extracts from them visual features. The recurrent network (RNN) reads these visual features in order to recognize the most important information and to keep a compact representation of the entire sequence. Finally, the regressor reads and interprets the last representation of the sequence to produce a real number as result. Figure 1 depicts the proposed architecture.

Fig. 1.
figure 1

Architecture for visual arithmetics. The inputs \(x_i\) are images of handwritten digits and symbols, which are processed independently by a convolutional network. The feature representation \(\phi (x_i)\) of each digit is processed by an LSTM that encodes a latent state \(h_i\) sequentially. The last state of the recurrent network is read by the regression network to produce an output.

Convolutional Network. The convnet is a function that builds a feature embedding for an image. We use a three-layer convnet that takes inputs of \(28\times 28\) pixels with a single gray-scale channel. The first two are convolutional layers that transform the input into a feature map that applies the ReLU non-linearity (\(R(x)=max(0,x)\)) and a \(2\times 2\) max-pooling operation. The first layer has 32 filters of \(5\times 5\) pixels and the second layer has 64 filters also of \(5\times 5\) pixels. At last there is a fully connected layer that embeds the output into a 1.024 feature vector.

Recurrent Network. An RNN models a dynamic system whose inputs vary with time and can take a sequence \(X=\{\phi (x_1), \phi (x_2), \ldots , \phi (x_T)\}\) of arbitrary length T. The goal of an RNN is to progressively encode information of the sequence in a vector representation called hidden state. Each element of X is processed sequentially, and the output is the new hidden state \(h_t\) that depends on the previous state \(h_{t-1}\) and the current observation \(\phi (x_t)\). A simple RNN function can be modeled as

$$\begin{aligned} h_t(\phi (x_t), h_{t-1}) = \tanh \left( W^r \phi (x_t) + Uh_{t-1} \right) , \end{aligned}$$
(1)

where \(W^r\) and U are linear transformations of the inputs, and \(\tanh \) is an element-wise nonlinear activation function. Simple RNN models are difficult to train when input sequences are long because the gradient tends to vanish through time. However, alternative formulations of RNN functions have been proposed, which have trainable memory mechanisms for dealing with longer sequences. In particular, we use the Long-Short Term Memory (LSTM) unit.

An LSTM unit has a memory vector c in addition to the hidden state h, and it implements gates that allow reading, writing and resetting information into the memory vector. The memory and hidden state vectors are formulated as

$$\begin{aligned} \begin{array}{lcl} c_t(\phi (x_t), c_{t-1}) = f \odot c_{t-1} + i \odot g&~~~~~~~~&h_t(\phi (x_t), h_{t-1}) = o \odot \tanh (c_t), \end{array} \end{aligned}$$
(2)

where \(c_t\) is the memory content at time t, i and f are functions that control writing and resetting the memory content, g is the transformed input, and o is a function that controls the output to the hidden state. Notice that the writing (i), resetting (f), and reading (o) gates have element-wise multiplicative interaction (\(\odot \)) with the information in the memory vector.

We adopt the LSTM architecture as the recurrent network in our model, using a hidden state \(h\in \mathbb {R}^{512}\), and input vectors \(\phi (x)\in \mathbb {R}^{1024}\) produced by the convnet. The LSTM architecture is useful in the problem of processing sequences of handwritten digits, because the memory structure allows to preserve information of the order in which digits appear in the sequence, and also the position of the arithmetic operator. Getting this information right in the final representation is crucial to produce a correct interpretation of the operation.

Regression Network. The final component of our architecture is a network that reads the last state of the recurrent network (\(h_T\)) and computes the result of the arithmetic operation. We use a two-layer, fully connected network to transform \(h_T\) into a single real number. This network is supervised with respect to the true result of an arithmetic operation using the regression loss function

$$\begin{aligned} \mathcal {L}(y,\hat{y}) = \left\{ \begin{array}{lcc} \left| y - \hat{y} \right| - 0.5 &{} ~~~~ &{}\text {if}~ \left| y - \hat{y} \right| > 1 \\ 0.5(y-\hat{y})^2 &{} ~~~~ &{} \text {otherwise} \end{array} \right. , \end{aligned}$$
(3)

where y is the ground truth result. This loss function is also known as smooth-L1 loss, and has very stable gradients for regression problems of arbitrary output. It has constant gradient if the absolute value of the error is greater than 1, and linear gradient when the error is smaller than 1. It works well in our arithmetic operations problem, because large sequences usually result in outputs with big numbers.

4 Experiments and Results

We evaluated the proposed architecture with three setups considering only additions, only substractions and a mix considering both. In all cases, we conducted experiments with sequences involving two operands and one operator. Both operands are natural numbers with the same number of digits. We considered operands with one, two and three digits each.

4.1 Dataset and Sequences

To generate sequences for feeding the proposed architecture we use the MNIST handwritten digit database, which is composed by 55.000 training examples and 10.000 testing examples. This dataset was extended with a set of arithmetic symbol images of which 144 are for training and 36 for testing. Each dataset example is a \(28\times 28\) pixels gray scale image.

We generate sequences of arithmetic operations as follows, the operator is chosen from the set of symbols depending on the model that we are training, and the operands are constructed digit by digit. At each position of the sequence, we sample a random digit from the MNIST database with uniform probability. Corresponding ground truth is also calculated. Both training and testing sequences were generated following this procedure, but using separate sets of data.

4.2 Training

The CNN was pre-trained to classify each of the 10 digits with a softmax classification layer. This layer was connected to the last fully connected layer of the proposed architecture with 1.024 features. The softmax classifier was removed, and the weights of the network were used as initialization values. This architecture was trained end-to-end using the Adam optimization algorithm with a learning rate of 0,0001. Training was run until achieving a training accuracy of at least 95%. We used mini-batches of 64 sequences, each sequence with 3 to 7 images. The hyper-parameters of the network were cross-validated until a stable configuration was found.

Table 1 presents an overview of the training results for the considered models. Notice how longer sequences need longer training times to reach a useful solution. This is explained by the large amount of examples that an operation can have as we add more digits to the operands. The combination of digits leads to an exponential growth on the number of potential operations to consider, making harder the problem of training the model for long sequences.

Table 1. Training configuration of the three trained models

Figure 2 shows the performance of the addition and subtraction model for the three considered lengths of sequences, a similar behavior was described for the only addition and only subtraction models. First thing to notice is that mini-batch error is inversely proportional with the training accuracy as the number of iterations grows. It can also be observed that for longer sequences, training error is higher during the whole training session. This is due to the range of the regression output, as well as the difficulty of training longer sequences. Interestingly, subtraction reaches the best performance in less iterations than addition, and similarly happens with the model that computes both addition and subtraction.

Fig. 2.
figure 2

Effect of varying sequence length during training for the addition and subtraction model.

4.3 Testing

In general, all tested models are relatively successful as shown in Table 2. Accuracy was computed by rounding up the predictions and comparing the percent of matches with the ground truth. For sequences of length 3 we reach accuracy above 97%, more than 90% for sequences of length 5, and between 75% and 85% for sequences of length 7. Longer sequences are more difficult to interpret possibly due to the lack of robustness of the regression network.

Table 2. Test results on the held-out dataset for the three models.

Among all three models, subtraction exhibits the best accuracy. The testing accuracy of the addition-subtraction model is slightly lower than the testing accuracy for subtraction only. This can be attributed to the introduction of addition operations, whose symbol seems to be harder to identify than the subtraction symbol.

Table 3 depicts some sample test sequences. The failure cases for short sequences are typically due to a few decimals from the ground truth. However for longer sequences the error can be off by a large margin. Although our goal was never to achieve 100% accuracy on the output prediction, but rather to study the problem of visual sequences analysis in the context of RNNs, the proposed system was still able to achieve a good performance, motivating further studies in this direction in problems where needs to be taken into account long-term dependencies in visual sequential data.

Table 3. Success and failure examples of test sequences.

5 Conclusions and Future Work

This work presented and evaluated an architecture to compute arithmetic operations from sequences of hand-written digits. The model is very successful at predicting the correct output for sequences with operands of one or two digits each. However, the model becomes harder to train for longer sequences, reaching a moderately successful result for operands with three digits. The proposed architecture has also shown the effectiveness of RNN models for visual sequence analysis. In particular LSTM capabilities to memorize and summarize sequential information. In our future work, we aim to explore more efficient training strategies to optimize the exploration of combinations of operations of certain length. We also plan to experiment with other operators in arbitrary positions of the sequence.