Keywords

1 Introduction

Handwritten signature verification is important for person identification and document authentication. It is increasingly being adopted in many civilian applications for enhanced security and privacy [14]. Among various biometrics, signature is a kind of behavioral characteristics, which are related to the pattern of behavior of a person [9]. Compared with physiological characteristics and other behavioral characteristics, handwritten signature has advantages in terms of accessibility and privacy protection. The use of handwritten signature for person verification has a long history, and so, occupies a special place in the variety of biometric traits due to the tradition. However, signature verification is difficult due to the variation of personal writing behavior and the high similarity between genuine signatures and forgeries.

In the past decades, many methods of signature verification have been proposed. The methods can be divided into on-line signature verification [1, 4, 26] and off-line signature verification [9, 12, 19] depending on the manner of data acquisition and recording. On-line signature verification is achieved through the acquisition of temporal stroke trajectory information using special electronic devices. Off-line signature verification is done by using signature images obtained by scanning or camera capturing. Off-line signature verification is more challenging because the temporal information of strokes is not available. However, due to the popularity of handwritten documents, off-line signature verification is needed in many applications.

Most works of signature verification have focused on the techniques of feature representation and similarity/distance metric evaluation, similarly in face verification and person re-identification etc. [27]. For feature extraction, different descriptors had been presented. Gilperez et al. encoded directional properties of signature contours and the length of regions enclosed inside letters [8]. Guerbai et al. used the energy of the curvelet coefficient computed from signature image [9]. And Kumar et al. designed surroundedness feature containing both shape and texture property of signature image [19]. Some methods learn feature representation using convolutional neural network [10, 27, 28]. According to the strategy used in metric learning stage, the methods can be grouped as writer-dependent and writer-independent methods. In writer-dependent case, a specialized metric model is learned for each individual writer during training phase, and then the learned metric model is used to classify the signature as particular writer, genuine or forgery one. In writer-independent case, there is only one metric model for all writers (tested in a separate set of users). Several types of classifiers have been proposed for metric learning, such as neural networks [19], Hidden Markov Model [21], Support Vector Machines [9, 19] and ensemble of these classifiers [12].

For off-line signature verification, most existing methods extract features from the whole signature image. However, the distinguishable characteristics of writing style are usually contained in writing details, such as the strokes, which are very difficult to be forged even for the skilled writers. On the other hand, some parts of signature can be relatively easier to be copied. Therefore, extracting features from whole images cannot suffice the verification accuracy.

In this paper, we propose a novel framework for off-line signature verification of writer-independent scenario. We use a Deep Convolutional Siamese Network for feature extraction and metric learning, and to improve the verification performance, we extract features from local regions instead of the whole signature image. The similarity measures of multiple regions are fused for final decision. The convolutional network is trained end-to-end on signature image pairs. We evaluated the verification performance of the proposed method on two benchmark datasets CEDAR and GPDS, and achieved 4.55% EER and 8.89% EER, respectively. These results are competitive to the state-of-the-art approaches.

The rest of this paper is organized as follows: Sect. 2 gives a detailed introduction of the proposed method; Sect. 3 presents experimental results, and Sect. 4 offers concluding remarks.

2 Proposed Method

The diagram of the proposed method is given in Fig. 1. The system consists of preprocessing, local region segmentation, feature extraction and metric model. The signature image pairs undergone preprocessing procedure are firstly segmented into a series of overlapping regions. Then, the local region images are fed into a Deep Convolutional Siamese Networks to learn the features and the feature differences between the corresponding local regions of input signature image pairs are used to build the metric model. The similarity measures of multiple regions are finally fused for verification. In training, the parameters are adjusted in the mode that similarity between matched pairs should be larger than those between mismatched pairs.

Fig. 1.
figure 1

The framework of presented method. It consists of four parts: preprocessing, a region segmentation layer, a feature extractor and a metric model.

2.1 Preprocessing

Preprocessing plays an important role in off-line signature verification as with most pattern recognition problems. In real applications, signature images may present variations in terms of background, pen thickness, scale, rotation, etc., even among authentic signatures of the same user, as shown in Fig. 2.

Fig. 2.
figure 2

Some signature samples in CEDAR (column 1), GPDS (column 2) and ChnSig (column 3) database. And the first two rows show genuine samples, the third row shows the skilled forgeries.

As Fig. 3 shows, we convert the input signature image as grayscale image firstly. Many samples in CEDAR database are rotated, so we need one more step of preprocessing for CEDAR database. The tilt correction method introduced by Kalera et al. [15] is employed to rectify the image. And the samples in CEDAR and ChnSig database are not clean in background. Therefore, we employ Otsu’s method [22] to binarize the signature image to get mask of foreground and background. Then we reset the pixels of background as 255 according to the mask of background to remove the background. For the foreground, we employ normalization method to normalize the distribution of grayscale in the foreground according to the mask of foreground, in order to remove the influence of illumination and various types of pen used by writers as follows:

$$\begin{aligned} g'_f = \frac{(g_f-E(g_f))\cdot 10}{\delta (g_f)} + 30 \end{aligned}$$
(1)

where \(g_f\) and \(g'_f\) denote original and normalized grayscale respectively, \(E(g_f)\) and \(\delta (g_f)\) denote the mean and variance of the original grayscale in foreground. In this way, the mean and variance of grayscale in foreground are normalized as 30 and 10 in experiments.

Fig. 3.
figure 3

The preprocessing strategy of the proposed method.

The signature images may have different resolutions or sizes, and the locations of signature strokes may present variation in different images. In order to match the locations of signature strokes to some extent from different images, we employ moment normalization [20] method to normalize the sizes and locations of signatures. Let f(xy) means the pixel of original image in the location (xy), and \(f'(x',y')\) means the pixel of normalized image in the location \((x',y')\). Then we can map f(xy) to \(f'(x',y')\) as follows:

$$\begin{aligned} x = (x'-x'_c)/\alpha +x_c \end{aligned}$$
(2)
$$\begin{aligned} y = (y'-y'_c)/\alpha +y_c \end{aligned}$$
(3)

where \(x'_c\) and \(y'_c\) denote the center of normalized signature, \(x_c\) and \(y_c\) denote the center of original signature, and \(\alpha \) is the ratio of the normalized signature size to the original signature size that can be estimated by the center moments of inverted image (signature strokes are in gray and background is black) as follows:

$$\begin{aligned} \alpha = 0.6 \cdot min(\frac{H_{norm}\sqrt{\mu _{00}}}{2\sqrt{2\mu _{02}}},\frac{W_{norm}\sqrt{\mu _{00}}}{2\sqrt{2\mu _{20}}}) \end{aligned}$$
(4)

where \(H_{norm}\) and \(W_{norm}\) denote height and width of normalized image, and \(\mu _{pq}\) denotes the center moments:

$$\begin{aligned} \mu _{pq} = \sum _{x} \sum _{y} (x-x_c)^p (y-y_c)^q \left[ 255-f(x,y)\right] \end{aligned}$$
(5)

We set \(H_{norm}\) and \(W_{norm}\) as 224 and 512 in experiments, which means that we normalized the size of signature images as \(512\times 224\) for the following feature extraction.

2.2 Feature Extraction

We used the Deep Convolutional Siamese Network which is composed of two convolutional neural network (CNN) branches sharing the same parameters to learn the feature representation of local regions of signature images. There are many popular CNN architectures such as AlexNet [17], VGG [23], ResNet [11] and DenseNet [13]. Through the experimental comparison, we choose a DenseNet-36 to constitute the Deep Convolutional Siamese Network for feature extraction. The structure of the DenseNet-36 is shown in Table 1.

Table 1. The structure of DenseNet-36

In particular, we feed inverted image into DenseNet-36, and we do not set dropout but add batch normalization for each convolution layers. The number of channels of the first convolution layer is set to be \(N_{init}=64\), and growth rate set to be \(k=32\) as described by Huang et al. [13]. We test two cases. One is that the input is the whole signature image, denoting as ‘whole’. The other case is that the input is the local region of the signature image, denoting as ‘region’. Therefore, we have different output sizes as described in Table 1. In all two cases, we flatten the feature maps of the last DenseBlock as feature vector, which is in \(244\times 16\times 7=27328\) dimensions for ‘whole’ or in \(244\times 7\times 7=11956\) dimensions for ‘region’.

2.3 Metric Model

After feeding two signature images or the local region of the signature images into the Deep Convolutional Siamese Network, we can get feature vector in pairs, represented by \(F_1,F_2\in \mathfrak {R}^d\), where d is the dimension of feature vector. The difference between these corresponding feature vector pairs is applied to be the similarity measure. We have tried Cosine, Euclidean distance, and absolute value of feature vector pairs as the difference measure and found that the “absolute value”, denoted as \(F= \left| F_1-F_2 \right| \), performs the best. Then, a linear layer is added to project the feature vector F to a 2-dimensional space with base vectors of \((\hat{p_1},\hat{p_2})^T\), where \(\hat{p_1}\) represents the predicted probability that the two signature belong to the same user, and \(\hat{p_2}\) represents the predicted probability of the opposite situation (\(\hat{p_2}+\hat{p_1}=1\)). In this way, the signature verification can be treated as binary-class classification problem and use cross-entropy loss as object function to optimize our model as follows:

$$\begin{aligned} Loss(p,\hat{p})=-\left[ p\cdot ln(\hat{p_1})+(1-p)\cdot ln(\hat{p_2}) \right] = \sum _{i=1}^2 -p_i\cdot ln(\hat{p_i}) \end{aligned}$$
(6)

where p is the target class (same or different) and \(\hat{p}\) is the predicted probability. If the two signatures are written by the same user \(p_1=1\) and \(p_2=0\), otherwise \(p_1=0\) and \(p_2=1\). Then we can use \(\hat{p_1}\) to approximate similarity measure of the two signatures.

2.4 Region Based Metric Learning

As described before, in order to improve the verification accuracy, we extract the features from local regions instead of the whole signature image. The local regions are obtained by a sliding window of size \(224\times 224\), scanning across the input signature image with a step of 36 pixels. Among the resulted 9 overlapping local regions from \(512\times 224\) signature images, the first and last regions are abandoned since they do not contain much useful information for verification. The remained 7 regions are applied to the Deep Convolutional Siamese Network for feature extraction and metric learning. Specifically, the difference between the corresponding regions obtained from the input signature image pairs are employed to be similarity measure and are finally fused by averaging for final decision. All the 7 regions are used to optimize the metric model learning in training stage while differences between several regions are chosen in testing stage for verification.

3 Experiments

There are three metrics for evaluating the off-line signature verification system: False Rejection Rate (FRR), False Acceptance Rate for skilled forgeries (\(\text {FAR}_{\text {skilled}}\), in this paper we only consider about skilled forgeries, so we use FAR for convenience) and the Equal Error Rate (EER). The first one is the rate of false rejections of genuine signatures, the second one is the rate of false acceptance of forged signatures, and the last one can be determined by ROC analysis [5] where FAR is same as FRR.

3.1 Datasets and Implementation Details

Three databases are used for evaluation. There are two popular benchmarks: CEDAR [15] and GPDS [7] database, the third one is established by us consisting of Chinese handwritten signature database named ChnSig.

CEDAR database is an off-line signature database created with data from 55 users. At random, users were asked to create forgeries for signatures from other writers. Each user has 24 genuine signatures and 24 skilled forgeries. So we can get \(C_{24}^2=276\) genuine-genuine pairs of signatures as positive samples and \(C_{24}^1\times C_{24}^1=576\) genuine-forged pairs of signatures as negative samples for each user. We randomly selected 50 user as training set and the remaining 5 users are testing set. Totally, we get 42600 samples for training and 4260 samples for testing.

GPDS database is an off-line signature database created with data from 4000 users. Each user has 24 genuine signatures and 30 skilled forgeries. So we can get \(C_{24}^2=276\) positive samples and \(C_{24}^1\times C_{30}^1=720\) negative samples for each user. We randomly selected 2000 user as training set and the remaining 2000 users are testing set. Totally, we get 1992000 samples for training and 1992000 samples for testing.

ChnSig database is an Chinese off-line signature database created by ourself with data from 1243 users. At random, users were asked to create forgeries for signatures from other writers. Each user has 10 genuine signatures and 16 skilled forgeries. So we can get \(C_{10}^2=45\) positive samples and \(C_{10}^1\times C_{16}^1=160\) negative samples for each user. We randomly selected 1000 user as training set and the remaining 243 users are testing set. Totally, we get 205000 samples for training and 49815 samples for testing.

We implemented our model on the platform of PyTorch, and trained our model using Adam [16] with the learning rate of 0.001. We used mini-batches of 64 pairs of signature regions (32 for whole images). Meanwhile, the dropout was set to be 0.3 on linear layer. For CEDAR and ChnSig database, we used the model trained in GPDS database for fine-tuning. Experiments were performed on a workstation with the Intel(R) Xeon(R) E5-2680 CPU, 256 GB RAM and a NVIDIA GeForce GTX TITAN X GPUs. The system takes only 10 ms to verify a pair of signatures on average.

3.2 CNN Architectures for Feature Extraction

We fed the whole of signature images into different CNN architectures for feature extraction, then measure the similarity of two signatures on GPDS database, in order to determine the architectures of feature extractor.

Table 2. Effects of different CNN architectures on performance (%) on GPDS database
Table 3. Effects of hyperparameter selection of DenseNet on performance (%) on GPDS database

As Table 2 shows, the DenseNet-36 architecture achieves the best performance and the model size is also smallest. So we determined DenseNet-36 architecture to be the feature extractor. To achieve better performance, we designed experiments for hyperparameter selection of DenseNet. We set different DenseBlocks and changed \(N_{init}\) and growth rate k, which are proposed by Huang et al. [13]. As Table 3 shows, the performance is better when the DenseBlocks are set as (3, 4, 6, 3) with \(N_{init}=64\) and \(k=32\).

3.3 Region Fusion

After determining the structure of feature extractor, we trained our model by regions of signature images described in Sect. 2.4. Firstly, we mark the regions from left to right as 1 to 7 and then test our model of system in them, and the results are shown in Table 4, where {i} means that we test our model on the i-th region. We can note that our model achieves the best performance in 4-th region. The reason is that the location of signature is normalized on the center of image by preprocessing, while 4-th region is just on the center of signature image, so that there is more information about signature strokes in the regions around center of image. Therefore, we choose regions around 4-th region to test the model in region fusion case.

Table 4. Performance of system in different regions (EER %)
Table 5. Performance of system in different region fusion cases (EER %)

In the region fusion case, we take different groups of regions that are symmetrical about the center of the signature image, then fuse the similarity measures of these regions. In Table 5, ‘Whole’ means that we feed the whole of signature image into our model, and {i, j, ...} means that we fuse the similarity measures of i-th, j-th, etc. regions. As Table 5 shows, we can note that feeding 4-th region achieves better performance than feeding the whole of signature images into the model. The reason is that it is difficult to extract good features in details from the whole of signature image, so that the metric model is easy to be effected by the areas where the signature strokes are similar. And the system achieves better performance when we fuse the similarity measures of 1-st, 4-th and 7-th regions comparing to other cases. In addition, the proposed method also performs well in Chinese corpus, that our system achieves 9.91% EER on ChnSig database.

3.4 Comparative Evaluation

We choose the combination of the parameters that achieve the best performance in the above discussion, and evaluate our model on two public benchmarks of off-line signature verification. The results are listed in Tables 6 and 7 with comparison to state-of-the-art methods. It should be mentioned that some methods presented the Average Error Rate (AER) instead of EER, which is the average of FAR and FRR. The difference between EER and AER is not great, so we consider them equivalent.

Table 6. Comparison between proposed and other published methods on CEDAR database (%)
Table 7. Comparison between proposed and other published methods on GPDS database (%)

From Tables 6 and 7 we can see that the proposed system outperforms all the compared methods on CEDAR and GPDS database. And the system of Chen et al. [2] and Chen et al. [3] reported in Table 6 are writer-dependent, which have to be updated if a new writer is added. On the other hand, the proposed system can be used for any newly added writer without re-training the system. The other systems reported in Table 7 (except Soleimani et al. [24] and Xing et al. [27]) are tested on GPDS database with different numbers of user. It is more persuasive that our system is tested on the biggest database and achieves state-of-the-art comparing with the other systems.

4 Conclusion and Future Work

In this paper, we propose a novel framework for off-line signature verification using a Deep Convolutional Siamese Network for metric learning. For improving the discrimination ability, we extract features from local regions instead of the whole signature image and fuse the similarity measures of multiple regions for verification. Feature extractors of different regions share the convolutional layers in the convolutional network, which is trained with signature image pairs. In experiments on the benchmark datasets CEDAR and GPDS, the proposed method achieved 4.55% EER and 8.89% EER, respectively, which are competitive to state-of-the-art approaches. The method can be further improved by polishing metric model and using more challenging datasets.