Keywords

1 Introduction

Object localization is an important task in the process of industrial automatic production, for example, pick and place of an industrial robot, position localization in surface mount technology (SMT) etc. Template matching method is mostly adopted for this task in machine vision applications. NCC based template matching applies normalized cross correlation as the features [1], shape based template matching adopts edge feature for object localization [2]. In many cases, this kind of methods performs well, but these methods have drawbacks in that they are less robust against geometrical distortion including occlusion, deformation, illumination distortion, motion blur or extreme low contrast.

In recent years, CNN based object detection methods have achieved state-of-the-art result on classification tasks [3,4,5] and object detection tasks [6,7,8]. Current CNN based object detection methods mainly focus on general object detection in nature sense. As objects have different scales and different aspect ratios, sliding window based method [9,10,11,12] will lead to a very large computation complexity, so region proposal based and regression based methods are adopted. This kind of method was first proposed by Girshick et al. [6]. For accelerating, Fast-RCNN [13] by Girshick and Faster-RCNN [8] by Shaoqing Ren were proposed. Recently, regression models based methods became a new research hotspot [14, 15] and more rapid methods, like YOLO [16] and single shot multibox detector (SSD) [17], are proposed.

There are 3 main challenges for applying current CNN techniques to object localization in the industrial environments, (1) it’s unrealistic to label a lot of training data for a single scene in industry; (2) the method should be fast enough to cope with the large-capacity production line; (3) it should be robust to deal with the variety of products. Current state-of-the-art methods, like SSD [7] and Faster-RCNN [8], are rapid, but these methods require a huge amount of training data and training time, which is not suitable for practical applications in industry.

In the industrial automatic inspections, objects usually have a fixed scale and aspect ratio, so we only need to scan the whole image by only one fixed window, which makes the computation complexity acceptable. What’s more, this kind of methods have less false negative. But we find if only adoption a signal model, there are many false alarms in the localization results.

Above all, we design a cascaded convolutional neural network (C-CNN) based method for object localization in the industrial environment. The proposed C-CNN can achieve a rapid localization speed and it is robust even we only use a small size of training data. Our method runs 14 FPS on GTX970. In the following section, we present the overall framework of proposed method and the details, and then we introduce our experiments and compare our method to traditional template matching and current state-of-the-art CNN based methods.

2 Cascaded CNN Detector

For object localization in industry, our object detector is shown in Fig. 1. Given a testing image, we first resize the image to a small scale and use net-16 to densely scan the whole image to reject most of the background windows. Then two networks, net-32-1 and net-32-2, further reject the remained background windows. The passed windows are accepted as the rough detection results. We apply a relatively deep model net-4 to adjust the passed windows. Non-maximum suppression [17] (NMS) is adopted to eliminate highly overlapped detection windows at the end of part 1 and part 2, seeing in Fig. 1.

Fig. 1.
figure 1

Testing pipeline of our method. First we detect in the scaled image and adjust in the original image finally.

2.1 The net-16

Training of the net-16. The structure of net-16 is shown as Fig. 2, which is a binary CNN classifier for classifying objects and backgrounds. We select Rectified Linear Units (ReLU) as our active function [18]. ReLU has been widely used in many work [3], and has been proved that it can improve expression ability of network and speed up convergence. Softmax loss function is adopted as our cost function; we also apply weight decay for avoiding overfitting.

Fig. 2.
figure 2

CNN structures of net-16.

We crop the object patches as the positive examples, and other regions have an intersection over union (IOU) less than 40% with objects are regarded as negative examples. The numbers of negative examples are much more than positive examples. So we adopt rotating, Gaussian blur and Gaussian noise for data augmentation. We keep the positive and negative examples have a ratio 1 : 4 while training.

Testing of the net-16. We adapt fully convolutional network instead of densely scanning the whole image, which can eliminate redundancy computation and achieve same results. When testing, the fully connected layer of net-16 is converted to a convolutional layer with \(7\times 7 \) kernel size.

For a \(W \times H \) testing image with \(a\times a \), we first resize the image to \(w \times h\). Here, \(w = 16/W\), \(h = 16/H \). Then input the resized image into the converted net-16 and obtain a map of confidence. Every confidence refers to one window on the original image, and the windows with a confidence less than threshold \(T_1=0.9\) will be rejected.

2.2 The net-32

The net-32 is divided into two sub-network net-32-1 and net-32-2. Both networks are binary classifiers.

Training of the net-32. Through our experiments, using only one network cannot reject fault detection, so we design a deep network net-32-1 and a shallow network net-32-2. The two network structures are shown as Fig. 3. Deep network structure will help to extract more semantic information, and the shallow structure can retain more details [19].

Fig. 3.
figure 3

CNN structures of the net-32-1 and net-32-2.

For the training of these two networks, we apply the trained net-16 to carry out hard negative mining. We use net-16 to scan the images, the windows which have confidence higher than \(T_1 \) and have an IOU less than 0.5 with positive windows will be regard as negative examples.

Considering the remaining windows of net-16 will be not accurate enough. We randomly sample windows which have same size with objects and have more than 70% IOU with ground truth as positive examples. We also employ data augmentation strategy in this part.

Testing of the net-32. We evaluate the passed windows of net-16 and the 8 windows around them by net-32. Given a passed window (xyaa) centering at (xy) of (aa) size and the size of original image is \(W \times H \). Then our evaluated windows are \((x',y',a,a) \), where \(x'=x\pm (rS_x) \) , \(y'=y\pm (rS_y) \). Here, \(r \in \{-0.75,0,0.75\} \), \(S_x = 2W /16 \) and \(S_y = 2H /16 \). The windows corresponded to a confidence higher than threshold \(T_2 = 0.85\) are regarded as detection windows.

2.3 The net-4

The net-16 and net-32 are not accurate enough, so we train another CNN, called net-4, to adjust the detecting windows. The net-4 is a 9-class classifier for the object pattern and its eight surrounding patterns.

Training of the net-4. As object region patterns and its surrounding patterns are similar to each other, we design a relatively large CNN for this task. For an object of size \(a \times a \), the input size of net-4 is \(a \times a \), and the outputs of net-4 are the confidence corresponded to these 9 regions. The structure of net-4 is shown as Fig. 4. net-4 has a similar structure with net-32-1, but the structure of net-4 is wider than net-32-1 for obtaining more information.

Fig. 4.
figure 4

net-4 consists of three convolution layers, two pooling layers, one fully connected layer and a softmax classifier.

For an object window (xyaa), we crop the object windows’ patches and 8 surrounding windows’ patches as our training data. These nine windows can be expressed as

$$\begin{aligned} (x+r_xa,y+r_ya,a,a), \end{aligned}$$
(1)

where \(r_x\in \{-0.15,0,0.15\} \) and \(r_y \in \{-0.15,0,0.15\} \). We also apply some data augmentation measures. It should be noted that the rotated patterns should be cropped from the rotated images instead of rotating the cropped patterns directly.

Testing of net-4. The net-4 accepts the passed windows as input and distinguishes which pattern of this window should be, then adjusts the center coordinate of the detection windows. The adjusting process is shown as Fig. 5.

Fig. 5.
figure 5

The adjusting pipeline for a passed window.

3 Experiments and Analysis

In this section, we evaluate the performance of the proposed approach and compare our method to other methods. We adopt precision rate (\(P_{rate}\)) and recall rate (\(R_{rate}\)) to measure the performance of these algorithms. \(P_{rate}\) and \(R_{rate}\) are defined as Eq. (2).

$$\begin{aligned} \left\{ \begin{array}{l} P_{rate} = \frac{TP}{TP+FP} \\ R_{rate} = \frac{TP}{TP+FN} \\ \end{array} \right. \end{aligned}$$
(2)

where TP is true positive; FP is false positive; FN is false negative.

3.1 Experiments Results

For evaluating our proposed algorithm, we test our method in different image sets. Each of image sets are compose of 100–200 images with resolution of \(640\times 480\). Figure 6 demonstrates some sample images from the testing image data sets. We can see that there are lots of interferences and the backgrounds are also very complex, and the targets are very small, blur and deformed, which make the detection task very difficult.

Fig. 6.
figure 6

Some sample images from testing image data sets, the top left called semi images, the top right called flower images, the bottom left called candy images, and the bottom right called screw images.

We only picked out and annotate about 20–30 images of them as the training data and the testing result is shown as Table 1. We find our method is robust although the training set is only composed of 20–30 images. All of \(P_{rate}\) and \(R_{rates}\) are higher than 97% in our experiments.

Table 1. Results of our proposed method

3.2 Comparisons Results and Analysis

We compare our method with the shape based template matching in Mvtech Halcon [21] (from a Germany based machine vision company) and SSD [7] method. For template matching algorithm, only one template image is needed. For SSD [7], We also pick out about 20–30 images as training data. We compare these three methods in different image sets. The results are similar, so we only show the result of candy images in Table 2. In the candy images, the candy wrapping paper is reflecting, deformable, distorted and low contrast with the background. The result is shown as Fig. 7.

Table 2. Results of template matching and SSD
Fig. 7.
figure 7

Object localization results of our proposed method (top), template matching (middle) and SSD (bottom). (Color figure online)

The results show that our method has a better performance than the template matching and SSD, the yellow box in Fig. 7. When the objects have large various visual changes and the shape features of objects are not obvious, the template matching nearly fails to detect the objects. Although we adopt 4-CNN models in our method, we apply multi-scale detecting and fully convolutional network to accelerate the algorithm. As we adopt the CNN in our detector, our method is easy to be parallelized on GPU. When using a moderate GPU card, GTX970, our method can achieve about 14 FPS which is comparable to traditional template matching methods. The runtime comparison illustrates in Table 3.

We also compare our method with a commercial software ViDi Suite [22], a deep learning based industrial image analysis software, developed by a Swiss software firm to solve industrial vision challenges. We also provide 20–30 images as training data for ViDi. The result is shown in Table 4 and Fig. 8.

Table 3. Runtime comparison

As the results shown in Table 4 and Fig. 8, the ViDi has a higher \(P_{rate}\). But our method can obtain a better recall rate in three of these four image sets. Particularly, ViDi only has 75% \(R_{rate}\) in Candy images. As the background is usually stationary in industrial machine vision applications, the object localization is different from the wild object detection. There is not any publicly available dataset to compare.

Table 4. Results of ViDi object localization method
Fig. 8.
figure 8

The first row shows the results of proposed method and the second shows the localization results of ViDi.

4 Conclusion

Object localization is an important task in the industrial machine vision applications. Traditional template matching methods will completely fail to detect objects in some extreme cases and current CNN based methods are focused on general object detection in nature sense. We propose a cascaded CNN detector, C-CNN, specifically for object detection in industrial sense. The C-CNN method is proved to be robust through our experiments and can locate the objects in extremely poor quality images. It can outperform the traditional methods and the state-of-the-art methods with small number of the training images. Furthermore, the real time performance of our method is achieved on a moderate GPU. It can be utilized in practical machine vision systems.