Keywords

1 Introduction

Iris recognition [1,2,3,4] is one of accurate and widely adopted approaches to the automated personal identification and has become increasingly integrated into our daily life. Performance of iris recognition systems highly relies on effectiveness of iris segmentation. Iris detection and segmentation [1,2,3,4,5,6,7,8] aims at locating and isolating valid iris texture regions from eye images. The accuracy of iris segmentation greatly impacts the performance of subsequent procedures such as image normalization, feature extraction, and pattern matching in iris recognition. However, traditional iris segmentation methods require the subjects to be captured under strictly constrained condition, and their performance decreases dramatically in non-cooperative environments such as remote iris recognition systems and on-the-move systems. The degradation is mainly caused by occlusions, specular reflections, blur and off-axis. That is one of the major difficulties restricting iris recognition systems from being deployed in civilian and surveillance applications more widely.

For friendly user experience, various iris segmentation methods have been proposed to improve the performance especially in non-cooperative environments and achieve remarkable performance. The existing methods can be divided into two categories, boundary based methods [2,3,4,5,6,7, 9, 10] and pixel based methods [11,12,13]. Boundary based methods segment iris texture regions by locating pupillary, limbic and eyelid boundaries, while pixel based methods directly distinguish iris pixels from non-iris pixels according to the appearance features around. However, boundary based methods are easy to be severely influenced by noise data around iris boundaries in especially low quality images, and pre-knowledge such as shapes of boundaries is usually required as the constraint. Though pixel based methods do not require any assumption of boundary shapes, handcrafted appearance features are still not able to distinguish the differences between iris region and non-iris region. Due to those shortcomings, the existing methods fail to facilitate a more robust and accurate iris recognition system in a further step.

To address the problems mentioned above, inspired by the procedure that human being recognize an iris, an efficient and accurate iris detection and segmentation network based on multi-scale optimized Mask R-CNN (IDSN-MS) is introduced. It is an efficient and flexible model which detects iris in a captured image while simultaneously generating a high-quality segmentation result for the iris. To reduce the influence of noisy data on boundaries and extract more distinct features, the proposed method utilized deep neural networks which segment the iris texture region directly according to its learnt features. Moreover, to reduce the time consumption of deep neural networks, it detects an iris region at first, and then segment the iris in the detected region. As the detection procedure is much faster than segmentation, segmentation on the small detected region is a wise choice in the consideration of time cost. That also makes the network concentrate on iris region rather than the whole image. Thus more parameters of the network are utilized to learn detailed features from the Regions of Interest (RoI), instead of rough boundaries. It benefits both accuracy and efficiency as the segmentation branch could pay much attention on the actual iris region. What’s more, the multi-scale fusion branch is introduce to fully utilized boundary information of feature maps in various scale levels. It facilitates faithfully preserve the information derived from local texture and global structure simultaneously.

The rest of the paper is organized as follows. Section 2 review the related work on iris segmentation. The proposed method is introduced in Sect. 3, and experimental results are shown in Sect. 4. Finally, the conclusion is drawn in Sect. 5.

2 Related Work

2.1 Boundary Based Methods

The boundary based iris segmentation methods are proposed as early as the primary research on iris recognition. The integrodifferential operator [5] proposed by Daugmand is one of the classic iris segmentation methods. The integrodifferential operator is an edge detector that searches the best fitted circle boundary over the whole image. Hough transform [9] is another iris segmentation method with the assumption of circle boundaries. After deriving a binary edge map, it finds the best set of boundary parameters that most of edge pixels in the map vote for. Similar methods [2,3,4] are also proposed. An elastic model called pulling and pushing method [7] which is inspired by Hookes law, treats boundaries of iris as circles as well. However, further researches indicate that circles are not always able to fit iris boundaries precisely due to non-circular iris boundaries in captured images, and the bad fitted boundary is one of the reasons for false reject matching. To address this problem, active contour model is introduced to iris segmentation [6] in Daugmans follow up work. Shah and Ross [10] further improve this kind of method by utilizing geodesic active contour and achieve accurate iris segmentation results. All these methods all achieve remarkable iris segmentation performance, but the shapes of segmented iris region by these methods will be severely influenced by noise data especially in the images with low quality or in non-cooperated environment, and result in failure of iris recognition.

Some iris detection methods [14, 15] are also performed based on the iris boundary. These methods determine whether an candidate pre-detected region is an iris region or not. Similar to boundary-based iris segmentation methods, the boundary-based iris detection methods are also easy to be influenced by noise data.

2.2 Pixel Based Methods

Different from the boundary based methods utilizing gradient and geometry information, pixel based methods mainly focus on how to distinguish the pixel of iris region from that of non-iris region according to the appearance features such as texture around the pixel. Various of handcrafted features such as Gabor filters [11], Zernike moments [12], location and color features [13] are employed as appearance features, and classic classifiers such as support vector machines (SVMs), graph cut methods and Gaussian mixture models (GMMs) are trained to distinguish iris region features from non-iris region features. In this way, noise boundaries can be removed. These methods segment and isolate iris texture region from well captured images exactly, without any assumption of boundary shapes. However, handcrafted appearance features are still not able to distinguish the differences between iris region and non-iris region.

2.3 Deep Neural Network Based Methods

Nowadays, deep neural networks are paid more attention, as they achieve remarkable results in object instance segmentation tasks. A variety of iris segmentation work based on convolution neural networks (CNNs) are proposed recently. Fully convolution deep neural network (FCDNN) [16], multi-scale fully convolutional networks (MFCNs) [8] and Seg-Edge bilateral constraint network (SEN) [17] are proposed specifically for the iris segmentation task. MFCN [8], an end-to-end iris segment model, is a extension of FCNs. MFCN fuse multi layers from shallow-and-fine layers to deep-and-coarse layers, which balances the information from local texture and global structure. Their superior performance illustrates outstanding effectiveness of Deep Neural Networks (DNNs) on iris segmentation. However, the above remarkable performance highly relies on a large number of parameters of the model, which are time consuming. That stops the above methods to be deployed in real time iris recognition systems.

3 Proposed Method

In this section, the proposed iris detection and segmentation method is illustrated in detail. At first, it detects actual iris location, and crop an RoI based on the detection result. Then iris segmentation is performed on the RoI, which is much faster than the segmentation on the whole image. The proposed model is an extension of the Mask R-CNN, and achieves more accurate iris segmentation results with the attendance of the multi-scale fusion. Superior performance suggests the proposed method is not only an accurate but also efficient iris detection and segmentation method, which makes more accurate iris segmentation available in real time iris recognition systems.

The Mask R-CNN [18] approach is an extension of Faster R-CNN [19]. The approach is able to accomplish object detection, classification and object instance segmentation simultaneously in its network architecture. On the part of object instance segmentation, it predicts a segmentation mask on each RoI in a pixel-to-pixel manner, by adding a mask branch. The mask branch is in parallel with the branch for classification. That prevents it from being influenced by the classification task and makes it fully focus on the segmentation task to faithfully preserve the explicit instance spatial position. In addition, the mask branch is a small FCN which does not consume much time. That allows the approach available in real time segmentation systems.

3.1 Backbone Structure

ResNet-50 [20] is used as the basic network here, due to its efficiency and expressive feature extractors benefitted from its deep but less complex architecture. Then feature pyramid network (FPN)[21] is adopted to enhance the expression ability of features.

Fig. 1.
figure 1

Network architecture of the proposed method. The iris image is fed into the ResNet-FPN backbone for feature extraction. The RPN and RCNN head are standard components of Mask R-CNN. Attention module obtains the feature map to be segmented, and multi-scale fusion module fuses feature maps with different scales.

The overall architecture of the proposed method is shown in Fig. 1. Features of input images are extracted by the ResNet-50, from which rough candidate feature maps are obtained through the Region Proposal Network (RPN). The generated feature maps are used for iris detection and iris segmentation. After the attention module [22], iris bounding box is generated, along with the extracted feature map. Then the multi-scale fusion module expends feature map into various scales to fully investigate iris boundary information. The final iris segmentation mask are gained afterward.

3.2 Attention Module

In terms of iris segmentation, most of the incorrect segmentation appears in the eyelids or glasses frames. It will be more effective in reducing the incorrect segmentation, if iris regions are paid more attention. More accurate areas to be segmented will be more conducive to network learning, due to the limited representation ability of the network.

Therefore, the detection network is used to obtain the bounding box of the iris, and the correct bounding boxes are reserved as the pre-selection box of masks. On the corresponding feature map, ROIAlign is used to extract feature maps of \(14\times 14\), \(28\times 28\) and \(56\times 56\) pixels according to the pre-selected boxes. In this way, the mask network obtains different scale features and is able to focus on segmentation much better. Meanwhile, the iris mask is extracted on the feature map instead of the entire feature map, which reduces calculation consumption significantly while the accuracy improves.

3.3 Multi-scale Fusion Module

Feature maps with different sizes represent iris boundary information from various aspects. The feature map with small size exhibit the rough contour of iris region, and have great resistance to noisy boundaries. As the size of the feature map increases, details of iris boundaries appear clearly, together with incorrect noisy segmentation results. To further take advantage of iris region information in different scale levels, feature maps with different scales are fused. Detailed fusion procedure is shown in Fig. 2.

Fig. 2.
figure 2

Flowchart of multi-scale fusion.

Specifically, feature maps which are resized into \(14\times 14\), \(28\times 28\) and \(56\times 56\) pixels, are derived from attention module. Convolution operations are performed on the three feature maps to obtain segmentation features of different scales. After the upsampling operation by bilinear interpolation, feature maps of different scales are all interpolated to \(56\times 56\) pixels. To balance the robustness and precision of different scales, feature maps are fused into a single feature map based on the weighted sum rule. Then the fused feature map gets through four convolution operations and one deconvolution operation. At last, the fused feature map is resized into \(112\times 112\) pixels for iris segmentation. The fusion operation can be expressed as:

$$\begin{aligned} F(x,y)=\delta _{1}*V_{1}(x,y)+\delta _{2}* V_{2}(x,y)+\delta _{3}* V_{3}(x,y)\quad \quad (x,y) \in M \end{aligned}$$
(1)

where F(x, y) represents the fusion features in position (x, y). \(V_{1}(x,y)\), \(V_{2}(x,y)\) and \(V_{3}(x,y)\) represent features of three different scales after sampling on different scale feature maps in position (x, y). \(\delta _{1}\), \(\delta _{2}\) and \(\delta _{3}\) represent the weights of the feature maps. M represents the set of points in the feature map.

The output probability is obtained with the help of sigmod function. Cross entropy function is adopted as the loss function. The cross entropy loss function is formulated as follows:

$$\begin{aligned} L=\sum _{i=0}^{N}y_{i}log\hat{y}^{i}+(1-y^{i})log(1-\hat{y}^{i}) \end{aligned}$$
(2)
$$\begin{aligned} \hat{y}^{i}=g(F(x,y))=g(\delta _{1}*V_{1}(x,y)+\delta _{2}* V_{2}(x,y)+\delta _{3}* V_{3}(x,y)) \end{aligned}$$

where \(\hat{y}^{i}\) represents the probability of each pixel. \(g(*)\) is the predicted probability. Each predicted pixels is obtained by the fusion of corresponding features at different scales. N represents the total number of pixels to be split.

4 Experiment

4.1 Datasets

Two public datasets UBIRIS.v2 [23] and CASIA.v4-Distance [24] are utilized to evaluate segmentation accuracy and efficiency of the proposed methods. UBIRIS.v2 contains 945 visible wavelength images captured on non-constrained conditions with various capture distance and illumination. There are 500 images and 445 images used for training and test respectively. Image in CASIA.v4-Distance are captured by the near-infrared device, and the dataset is separate into two parts, 300 images for training and 100 images for test. Iris regions in the images above are all well labeled with masks and bounding boxes.

4.2 Datasets Augmentation

To enhance robustness and generalization ability of the proposed method, training datasets are augmented by adding additional illumination, shadows and blur to images. Furthermore, images are also resized by 1 to 2 times randomly, to simulate various captured iris images on non-constrained conditions. All the images are cropped with the size of \(480\times 360\) pixels as the standard input. The total number of training images in UBIRIS.v2 and CASIA.v4-Distance is 5000 and 15000 respectively.

4.3 Training Pipeline

Mask R-CNN is employed as the backbone in the iris segmentation task. On the concern of time consumption, ResNet-50 is used to learn qualified feature extractor, due to its superior efficiency and feature discrimination. Then multi-scale fusion module is adopted to facilitate precise spatial locations of iris regions.

Training: Parameters of ImageNet are used to initialize the backbone network. The training schedule is extended to 160k iterations in which the learning rate is reduced from 0.0025 by 10 at 60k iterations. In the Region Proposal Network (RPN) module, the number of output bounding boxes is fixed to 2, 000. The bounding boxes whose Intersection over Union (IoU) is larger than 0.7 are treated as positive samples. The mask prediction threshold is 0.5.

Testing: The number of output bounding boxes in RPN is set to 1, 000, while the detection threshold is set to 0.7.

4.4 Experimental Results

Average segmentation error (ASE), a widely used indicator, is adopted to evaluate accuracy of the proposed method. It is computed as follow:

$$\begin{aligned} ASE=\frac{1}{N\times H\times W}\sum _{i,j \in (H,W)} G(i,j)\oplus M(i,j) \end{aligned}$$
(3)

where N is total number of test images. H and W denote the height and width of image, G(i,j) and M(i,j) are the ground truth mask and predicted iris segmentation mask. \(\oplus \) represents the XOR operation.

Baseline. Mask R-CNN, which is utilized as the backbone of the proposed method, is used as the baseline of iris segmentation. The size of the convolution kernel is \(3\times 3\) pixels, and the padding is 1. The size of mask feature map is \(14\times 14\) pixels (i.e. the size of RoI is \(14\times 14\) pixels) after 4 convolutions layers. Then the feature map is upsampled to \(28\times 28\) pixels. The ASEs of the baseline on the UBIRIS.v2 and CASIA.v4-Distance are 1.01% and 0.739%, respectively.

Table 1. Performance of feature maps with various scales. Results of RoI with the sizes \(14\times 14\), \(28\times 28\) and \(56\times 56\) before upsampling are demonstrated.

To investigate the influence of different feature maps size on iris segmentation, performance of feature maps (i.e. RoI) with sizes of \(14\), \(28\) and \(56\) are evaluated. Then the feature maps are upsampled to \(28\), \(56\) and \(112\) respectively. Experimental results are shown in Table 1. It illustrates that performance of different feature map sizes vary a lot and the feature map with larger size provides significant improvement on iris segmentation.

Multi-scale Fusion. Information of feature maps with different scales result in various iris segmentation results. Feature map with small size can learn the rough outer contour of the segmented image and have great resistance to noise. As the size of the feature map increases, the details of segmentation get better, while incorrect segmentation results appear due to noise.

To further enhance the segmentation performance, feature maps with multi-scale are fused as described in Subsect. 3.3. The multi-scale fusion result of feature maps with the sizes of \(14\times 14\), \(28\times 28\) and \(56\times 56\) pixels is shown in Table 2. \(\delta _{1}\), \(\delta _{2}\) and \(\delta _{3}\) in formula (1) are fixed to 1, so that feature maps with different scales have the same right to exhibit their information for segmentation. For the purpose of illustrating the positive impact of smaller feature maps, the feature map with the size \(14\times 14\) is removed and then the model is fine-tuned. That result can also be found in Table 2. Segmentation results of fusing feature maps with different scales differ a lot. The fusion of the three scales has a lower ASE than the fusion of two. It can be inferred that the rough boundary indicated by smaller feature maps help to resist noisy boundaries while larger feature maps faithfully preserving detailed boundaries.

Table 2. Multi-scale fusion segmentation results. (28-56) and (14-28-56) indicate the scales of feature maps to be fused.

Comparison with Other Iris Segment Algorithms. To illustrate the superior performance of the proposed method, it is compared with public state-of-the-art iris segmentation methods. Results are shown in Table 3.

Table 3. Comparisons of the proposed methods and other iris segmentation method. ‘-’ indicates the result is not reported in their work.

The training sets of all the methods in the table are the same. It can be seen that the proposed method not only achieve the most accurate segmentation result, but also accelerate the iris detection and segmentation procedure by nearly 3 times. Some iris segmentation results of images in UBIRIS.v2 and CASIA.v4-Distance are shown in Fig. 3.

Fig. 3.
figure 3

Segmentation results of the proposed method on CASIA.v4-Distance (the first row) and UBIRIS.v2 (the second row). The red pixels indicate the iris regions which are incorrectly predicted as non-iris regions, and the green pixels indicate the non-iris regions which are incorrectly predicted as iris regions. ‘iris’ is the probability that there exists an iris in the bounding box.

5 Conclusions

In this paper, a novel iris detection and segmentation method named IDSN-MS is proposed. It utilizes the location information of the iris to provide a more accurate segmentation feature for iris segmentation. Fusion of multi-scale feature maps improves the segmentation performance by utilizing segmentation information in various scale levels. Finally, the state-of-the-art iris segmentation results on two challenging UBIRIS.v2 and CASIA.v4-distance datasets are shown to demonstrate the superiority of the proposed method.