Keywords

1 Introduction

Building a reliable pork information tracking system is an important way to ensure that residents can buy reliable pork [1]. First, the public can confirm the safety of the purchased pork with the pork traceability system. Secondly, once the pork in the market has food safety problems, relevant inspection and quarantine personnel can quickly get access to pork production and circulation information with the pork food safety information traceability system to find out which part has gone wrong. The development of pork traceability technology is not only conducive to enhancing the safety of residents’ daily diet, but also regulates the pork production and sales industry, which brings benefits to related companies [2].

At present, many countries use the stamp on pig skin as a method of traceability which is shown in Fig. 1, but this method has many limitations. First, the label of the stamp may not be the same in different regions or companies. Secondly, the clarity of the stamp cannot be guaranteed when pork is stamped. In addition, during the circulation of pork, the stamp may be worn and unclear. Due to these problems, the information carried by the stamp is sometimes not effective to exchange information, and the traditional method of traceability is not reliable. This study will use the image registration technology to record the images using the features of the pork skin and overcome the above problems and achieve effective and reliable traceability of pork.

Fig. 1.
figure 1

Stamps on the pork skin

Image registration is the transformation that finds the best alignment between two images, that is, converting different datasets to the same spatial location. These data may come from different perspectives with all kinds of sensors. This algorithm is widely applied to many computer vision fields, such as plant leaves detection [3, 4], medical images analysis [5, 6], image repair [7], and satellite images [8].

There are two kinds of methods to match images. The first are the methods based on neighborhood information, such as Mutual Information [9], Cross Correlation [10], and FFT-based phase correlation method [11]. The optimal mutual information method is often used for the registration of medical images and remote sensing images because of its high precision. However, its computational speed deficiency makes it impossible to have further development in the field of high-precision image registration. The FFT-based phase correlation method has good scale and rotation invariance, but this method is sensitive to noise, and there must be a significant linear relationship between the registered images [12]. The second are the methods based on feature extraction, where the most common method is to extract point features in the image, such as Scale Invariant Feature Transform [13]. In addition to point feature, line features and surface features could also be extracted from edge detection algorithms [14] and image segmentation algorithms [15] respectively.

The method based on neighborhood image information (area-based) uses image intensity for registration. The feature-based approach can represent more advanced information; therefore, its descriptors are preferable in applications where the image intensity varies greatly. Because the algorithm proposed in this paper is an image registration method based on feature extraction, we will focus on the method of discussion (ii).

SIFT and its developed algorithms are always chosen in most feature-based image registration methods. SURF [16] uses image integration instead of convolution operation because its descriptor is 64-dimensional, its calculation speed is faster than that of SIFT, but its performance is not good in running on images with large scale changes. To overcome this problem, Morel et al. presented the Affine-independent ASIFT algorithm [17]. For the problem that the SIFT feature is too large due to the excessive number of dimensions. Liu drew on the idea of PCA-SIFT [18] proposed by He, and proposed the use of Kernel Independent Component Analysis [19] to reduce dimensionality, which is called KICA-SIFT [20]. However, in some multi-time or multi-sensor registration processes with visual differences, the feature points selected by the methods above might include some outliers. In more extreme cases, enough points cannot be selected by SIFT-like algorithms [21]. These problems have led to the limitations of the SIFT in registration process. We also found many different point set registration methods. The classic method is to estimate the degree of registration by using the idea of probability optimization, that is, using the Gaussian mixture model (GMM) like Coherent Point Drift [22].

In this paper, we have adopted a dynamic pork skin image registration algorithm. We demonstrate our method in Fig. 2. This method has the following two prominent features: (i) The pre-trained (on ImageNet [23]) VGG network [24] is used to generate multi-scale feature descriptors, and the extracted features can be obtained by convolution operations while obtaining more stable high-dimensional information of the image. (ii) A feature point set matching method with preference is used. This method uses the idea of gradually increasing the number of inliers in the matching process rather than the idea of distinguishing between inliers and redundant points. In the first stage of registration, Pre-matching is done by finding confidence points. Then, we found that the more feature points, the better optimization of the registration results. Also we can reduce the impact of mismatches. The registration is assessed by convolutional features and neighborhood information.

Fig. 2.
figure 2

The flow of our algorithm

We compared the above feature detection methods with SIFT and tested them on a multi-angle pork skin dataset and compared dynamic inlier selection with other three registration methods.

In the following sections, we will first demonstrate our registration algorithm based on convolutional architecture, including how to use the convolutional neural network as a feature extractor to extract high-dimensional features in the pig skin dataset, and the specific implementation of dynamic inlier selection. Later, we will compare these methods and analyze our experimental results. Finally, we will present the conclusions and future works.

2 Method

In this study, a remote sensing image registration algorithm based on image collection is used to analyze the pig skin image collection. We firstly use the VGG-16 pre-train model to extract image features without any screening. In order to get a rough registration parameter, we choose relatively reliable points to do the pre-match process. In the registration stage, we propose a point-to-point registration algorithm based on Gaussian mixture model to achieve accurate registration. Finally, the TPS (thin plate spline [25]) method is introduced.

2.1 Feature Extraction Using the VGG-16 Pretrain Model

Our feature descriptors are obtained by processing the VGG-16 pre-training model, which is a 1000 class image classification network. Compared with other models, the VGG-16 pre-training model has these advantages: (a). Except for very few convolution kernels, the size of other convolution kernels is 3 × 3 so that the receptive field of each convolution kernel is smaller. It is more suitable for images with more obvious local features such as pig skin images. (b). The structure of this model is relatively simple and very easy to modify or only use some of the modules. (c). Its excellent performance in the field of image classification proves that this model has excellent feature extraction ability, and its convolution kernel can extract common features and has strong generalization ability.

Because we use convolutional neural networks only for feature extraction, not for other tasks such as image classification, our use of CNN does not include the fully connected layer. Therefore, the size of the input can be any multiple of 32. However, due to the different input sizes, the receptive field of each kernel is distinct, and the excessive size of input image will seriously affect the calculation. At the same time, the input of our CNN model contains two images, if the sizes of the two inputs are not the same, the performance of feature extraction will be seriously affected. Therefore, considering the fact that the unified receptive field can greatly reduce the amount of calculation, we input the fixed size images (224 × 224) to the network.

Using one single network layer as an output cannot guarantee that effective features could be extracted from images under various conditions. In order to solve this problem, our feature descriptor is obtained through multiple layers. After placing an image generated using random values into VGG-16, the convolution kernel of each network layer can be visualized by a gradient rise. The shallow network layer pays attention to the detailed features, and the deep network layer outputs are influenced by the specific category, which is not suitable as the only feature descriptor. So, we choose \( {\text{pool}}_{3} \), \( {\text{pool}}_{4} \) layer output to build our feature descriptors (Fig. 3).

Fig. 3.
figure 3

Part of the VGG-16 network structure

VGG-16 has 5 convolutional blocks. Each convolutional layer contains multiple convolution operations. We use pooling layers after the third and fourth convolutional layers for feature extraction. The \( {\text{pool}}_{3} \) layer output is of the size 28 × 28 × 256. We divide the input image into 28 × 28 cells, each with 8 × 8 pixels. Each 8 × 8 square maps a 256-dimensional vector of the \( {\text{pool}}_{3} \) layer, so each square center serves as one feature point whose descriptor is a 256-dimensional vector, and the \( {\text{pool}}_{3} \) feature map \( M_{1} \) is the output of \( {\text{pool}}_{3} \) itself.

The size of the pool4 layer output is 14 × 14 × 512. Each 16 × 16 pixels of the input image map a 512-dimensional vector. The \( {\text{pool}}_{4} \) layer output cannot directly form features because the 16 × 16 pixels of the input image include 4 feature points, so the 512-dimensional vector generated in the square is shared by 4 feature points. The \( {\text{pool}}_{4} \) layer output after the Kronecker product (see Eq. 1) forms the \( M_{2} \). The method of feature extraction is shown in Fig. 4.

$$ M_{2} = O_{{pool_{4} }} \otimes I_{2 \times 2 \times 1} $$
(1)
Fig. 4.
figure 4

The perception field corresponding to different pooling layers. Green dots correspond to the output of the \( {\text{pool}}_{3} \) layer, and blue dots correspond to the output of the \( {\text{pool}}_{4} \) layer (Color figure online)

The obtained feature maps \( M_{1} \), \( M_{2} \) should be normalized before proceeding to the next step as shown in Eq. 2.

$$ M_{i} \leftarrow \frac{{M_{i} }}{{\sigma \left( {M_{i} } \right)}}, {\text{i}} = 1,2 $$
(2)

\( \sigma \left( {M_{i} } \right) \) calculates the standard deviation of \( M_{i} \). The feature point set of input image \( I_{a} \) is donated as \( S_{a} \) and the point at index \( x \) in \( S_{a} \) is donated as \( a_{x} \). The features of point \( a_{x} \) in mapping \( M_{1} \), \( M_{2} \) are donated as \( F_{1} \left( x \right) \), \( F_{2} \left( x \right) \) respectively.

2.2 Feature Pre-Matching

In order to reduce the computation and compare our feature descriptors with other feature extraction algorithms, pre-matching is introduced. As shown in Eq. 3, the weighted sum of the features extracted from \( \text{pool}_{3} \) and \( {\text{pool}}_{4} \) means the feature distance of the feature points \( a_{x} \) and \( b_{y} \)

$$ d\left( {x,y} \right) = \lambda d_{1} \left( {x,y} \right) + d_{2} \left( {x,y} \right) $$
(3)

The coefficient λ is taken here as \( \sqrt 2 \), because the \( {\text{pool}}_{3} \) feature is a 256-dimensional vector, and the \( {\text{pool}}_{4} \) layer is characterized by a 512-dimensional vector. The Euclidean distance is used to measure the similarity between features as shown in Eq. 4.

$$ d_{i} \left( {x,y} \right) = \left\| {F_{i} \left( x \right) - F_{i} \left( y \right)} \right\|, i = 1,2 $$
(4)

We assume that the feature point \( a_{x} \) matches \( b_{y} \) if they satisfy the following condition:

$$ \forall a_{{x_{0} }} \in S_{a} ,{\nexists }d\left( {x_{0} ,y} \right) > d\left( {x,y} \right) \wedge \forall a_{{x_{1} }} \in S_{a} ,{\nexists }d\left( {x_{1} ,y} \right) < \theta \cdot d\left( {x,y} \right) $$
(5)

The role of the parameter \( \theta \) is to control the fault tolerance of the pre-match stage. When \( \theta \) is larger, the match between \( a_{x} \) and \( b_{y} \) is more reliable. However, in this pre-matching stage, we need a large number of feature points to make a rough registration, so a smaller threshold \( \theta_{0} \) is chosen, which can select the most robust 128 pairs of feature points for pre-matching.

2.3 Dynamic Inlier Selection

After rough processing, we perform a fine registration of images. However, the feature points we extracted are evenly distributed at the center of 8 × 8 pixels instead of the randomly occurring points obtained by other feature extraction algorithms such as SIFT. The similarity between pixel patches varies. Therefore, we have designed a dynamic inlier selection method. For pixel blocks with a large degree of overlap, we believe that the matching of their corresponding feature points is more reliable and should play a greater role in the overall matching. For pixel blocks with lower overlapping, their effect on the registration process should be smaller because they may be errors or outliers. Even if they are not errors or outliers, their corresponding pixel patches are less overlapping, which means these features are not obvious, so their presence in the registration process should be reduced.

Convolution feature values and spatial neighborhood information are considered simultaneously when judging whether a pair of feature points match with each other. The probability that \( a_{x} \) matches \( b_{y} \) is donated as \( P_{R} \left[ {x,y} \right] \). The posterior probability matrix \( P_{R} \) derived from convolutional and spatial features can be obtained by the following steps:

Calculate the Convolution Feature Distance.

As shown in the pre-matching stage, when the point \( a_{x} \) matches \( b_{y} \), their Euclidean distances of eigenvalues should be small, while the Euclidean distance is larger when they do not match. In order to compare the distance and future processing, the distance needs to be normalized, and the resulting convolution cost matrix is shown in Eq. 6.

$$ C_{\theta }^{conv} [x,y] = \left\{ {\begin{array}{*{20}c} {\tfrac{{d\left( {x,y} \right)}}{{d_{\theta }^{max} }},} & {condition\;1} \\ {1,} & {otherwise} \\ \end{array} } \right. $$
(6)

\( d_{\theta }^{max} \) is the maximum value of all feature distance in the feature pair selected by \( \theta \). Condition 1 means that \( a_{x} \) and \( b_{y} \) satisfy Eq. 5.

Calculate Neighborhood Information.

Unlike the pre-matching phase, which only uses convolutional feature distance to select feature points, in the fine matching stage, we also consider the influence of neighborhood information. The shape context algorithm [26] can describe the neighborhood information of a point and form a histogram-based feature descriptor. Taking each feature point as the center point, then the polar coordinate system is established. After the polar coordinates are divided into several different arc-shaped regions, feature descriptors of each point are derived from the distribution of the surrounding points, and the resulting neighborhood cost matrix is shown in Eq. 7.

$$ C_{\theta }^{geo} \left[ {x,y} \right] = \frac{1}{2}\sum\nolimits_{b = 1}^{B} {\frac{{\left[ {h_{a}^{x} \left( t \right) - h_{b}^{y} \left( t \right)} \right]^{2} }}{{h_{a}^{x} \left( t \right) + h_{b}^{y} \left( t \right)}}} $$
(7)

\( h_{a}^{x} \left( t \right) \) means the number of points fell in the \( t^{th} \) region centered on \( a_{x} \).

Calculate the Overall Cost Matrix.

Combine \( C_{\theta }^{conv} \) and \( C_{\theta }^{geo} \) using Hadamard product, which is shown in Eq. 8.

$$ C = C_{\theta }^{conv} \odot C_{\theta }^{geo} $$
(8)

Calculate the Prior Probability Matrix.

The purpose of dynamic interior point selection is to select more reliable parts among the large number of feature points obtained by feature extraction. The LAPJV algorithm [27] can be used to determine whether \( a_{x} \) and \( b_{y} \) match. Finally, the prior probability matrix is given in Eq. 9.

$$ P_{R} \left[ {x, y} \right] = \left\{ {\begin{array}{*{20}c} {1,} & {a_{x} \;matches \;b_{y} } \\ {\frac{1 - \in }{N},} & {otherwise} \\ \end{array} } \right. $$
(9)

\( \in \), whose value is between 0 and 1, represents the degree of trust we have in the inliers selected by the above steps. The larger the value of \( \in \), the more we trust the selected feature points. N represents the number of point set \( S_{a} \).

At the beginning of the selection, we choose a larger initial threshold \( \theta_{1} \) to select the most reliable 64 points. In the future registration iteration process, inliers are reselected in every \( k \) iteration. After each selection, the value of \( \theta \) is subtracted by the step \( \updelta \), which is shown in Eq. 10, so that more points are used for registration each time. Since the more credible points are used in the earlier process, errors or outliers cannot have a fatal effect on the registration process. At the same time, the most reliable points affect the overall registration, while other points will optimize the registration effects.

$$ \delta = \frac{{\theta_{1} - \theta_{0} }}{10} $$
(10)

2.4 Gaussian Mixture Model

GMM is a method of describing the sample probability density distribution by weighting the Gaussian model and effectively expressing the sample distribution. Let the point in the point set \( P_{a} \) be the observation data generated by the GMM, and the center of the GMM is related to the point in the \( P_{b} \), and then the GMM probability density function is shown in Eq. 11.

$$ p\left( {a_{x} } \right) = \sum\nolimits_{m = 1}^{M + 1} {p\left( {a_{x} |m} \right)p\left( m \right)} $$
(11)

\( p\left( m \right) \), which values 1/M, is the proportion of the \( m^{th} \) Gaussian model in GMM and \( p\left( {a_{x} |m} \right) \) is the \( m^{th} \) Gaussian model. Also we add a uniform distribution of \( p\left( {a_{x} |M + 1} \right) \) to the mixture model to handle noise and outliers. Let the weight of uniform distribution be \( \omega \). The equation can be rewritten as Eq. 12.

$$ p\left( x \right) = \omega \frac{1}{N} + \left( {1 - \omega } \right)\sum\nolimits_{m = 1}^{M} {\frac{1}{{2\pi \sigma^{2} }}e^{{ - \frac{{\left\| {x - z} \right\|^{2} }}{{2\sigma^{2} }}}} } $$
(12)

Unlike the traditional GMM, the center \( z \) of our Gaussian model \( p\left( {a_{x} |m} \right) \) is determined by \( P_{b} \), Gaussian radial basis function \( G \) and the parameter matrix \( W \). The transformation formula can be defined as Eq. 13.

$$ Z = P_{b} + GW $$
(13)

The parameters \( \left( {\sigma^{2} ,\omega ,W} \right) \) of the GMM can be obtained from the maximum likelihood estimation, and the equivalent minimization of the negative log-likelihood function is shown as Eq. 14.

$$ L\left( {\sigma^{2} ,\omega ,W} \right) = - \sum\nolimits_{n = 1}^{N} {\log } \sum\nolimits_{m = 1}^{M + 1} {P_{R} \left[ {x,y} \right]p\left( {a_{x} |m} \right)} $$
(14)

2.5 Registration Using the EM Algorithm

We apply the EM algorithm [28] to the parameters \( \left( {\sigma^{2} ,\omega ,W} \right) \) in the negative log-likelihood function. The main idea of the EM algorithm is to calculate the posterior probability of each Gaussian model by using the Bayesian theory and the initial values of the parameters. After expanding and simplifying redundant items, Eq. 14 can be written as Eq. 15.

$$ Q\left( {\sigma^{2} ,\omega ,W} \right) = - \frac{1}{{2\sigma^{2} }}\sum\nolimits_{n = 1}^{N} {} \sum\nolimits_{m = 1}^{M + 1} {N_{p} - N_{p} \;log\left( {\frac{{\sigma^{2} \omega }}{1 - \omega }} \right) - N\;log\;\omega } $$
(15)

where \( N_{p} = \mathop \sum \limits_{n = 1}^{N} \mathop \sum \limits_{m = 1}^{M} p^{old} \left( {m|a_{x} } \right) \). Equation 15 is the lower bound of the likelihood function as shown in Eq. 14. The EM algorithm alternately executes E step and M step so that the Eq. 15 continually approximates the minimum value of Eq. 14.

E Step.

Calculate the posterior probability using the Bayesian formula and parameters obtained from previous iterations.

$$ P_{O} \left[ {x, y} \right] = p^{old} \left( {m|a_{x} } \right) = \frac{{P_{R} \left[ {x, y} \right]p\left( {a_{x} |m} \right)}}{{p\left( {a_{x} } \right)}} $$
(16)

M Step.

Update the parameters for next iteration.

$$ \sigma^{2} : = \frac{1}{{2N_{P} }}\sum\nolimits_{n = 1}^{N} {} \sum\nolimits_{m = 1}^{M} {p\left( {m|a_{x} } \right)\left\| {a_{x} - z_{y}^{2} } \right\|} $$
(17)
$$ \omega : = 1 - \frac{1}{N}\sum\nolimits_{n = 1}^{N} {} \sum\nolimits_{m = 1}^{M} {p\left( {m|a_{x} } \right)} $$
(18)
$$ W: = \left( {G + \lambda \sigma^{2} P_{d}^{ - 1} } \right)^{ - 1} \cdot \left( {P_{d}^{ - 1} P_{O} X - Y} \right) $$
(19)

where \( P_{d} = diag\left( {P_{O} \cdot \alpha } \right) \), \( \alpha \) is an N dimension column vector as \( \left[ {1,1, \ldots ,1} \right]^{T} \).

3 Experiment and Result Analysis

Our work has been verified on the pork skin dataset. We used SIFT to compare the proposed feature extraction algorithm and CPD [22], ICP [29] and KNN [30] to compare our registration algorithm respectively.

3.1 Experiment Design

Feature Pre-Matching Test.

Since feature pre-matching is the most important intermediate process of this method, we compare the convolution features used in this method with the traditional SIFT features. We test those two methods by extracting features from the same dataset and matching them. After that, we select the most reliable 100 pairs of pre-matched results and use them to calculate the accuracy rate Precision = TP/(TP + FP). The number of pre-matched results can be controlled by controlling the value of the parameter θ.

Image Registration Accuracy Test.

In this test, we mark 15 coordinate points in each pair of corresponding images. The test compares the coordinate of the original image with the coordinate of the image after registration, and measures the error of the distance between these coordinate point pairs. We used the median of the distance (MoD), the standard deviation of the distance (SDoD), the root mean square distance (RMSD), and the mean absolute distance (MAD) to measure our error (Fig. 5).

Fig. 5.
figure 5

Local pig skin image collection method and example.

Dataset.

The experimental place was in the room under natural light, and the fresh pork skin images were collected. Each pig skin was about 28 cm long and 21 cm wide. The illumination condition was fluorescent light scattering, the light intensity was about 100 l×, and the sensor was about 10 cm from the center of the pork skin. The sensor is Sony IMX 298. For each piece of fresh pork skin, it is divided into 9 areas of 3 × 3, and 0°, 60°, 120° rotations respectively based on the center of each area. Through the above operation, the data collection was carried out on different 30 pieces of fresh skinned pork, and a total of 810 pieces of pork skin images were collected for the experiment.

3.2 Results Analysis

Pre-Matching Test Results.

The accuracy of the feature pre-matching results on existing datasets using CNN features and traditional SIFT features is shown in Table 1. At the same time, in order to explain the position of the feature points in the pre-matching phase and the assignment of correct and erroneous points, a representative representation of the three of matching results is shown in Fig. 6. In addition, the proposed algorithm guarantees that each square region has only one feature point (because they have only one center) so that the distribution of feature points is more even in the overlap patch.

Table 1. Feature pre-matching precision test result
Fig. 6.
figure 6

Results of the feature pre-matching precision test

Fig. 7.
figure 7

Results of registration test

Registration Accuracy Test Results.

The values and comparisons of the registration accuracy are shown in Table 2. The experimental results show that the proposed method performs best, especially when the image displacement or rotation is small, our algorithm can accurately register each key point. Although KNN is not sensitive to outliers, this method consumes too much computation and takes a long time. When the scale of the image changes greatly, the accuracy of the CPD method will decrease. In addition, as shown in Fig. 8, our algorithm performs best when we rotate the image with a step of \( 10^\circ \) (Fig. 7).

Table 2. Registration accuracy test result
Fig. 8.
figure 8

The effects of rotation on the accuracy of four registration algorithms

4 Conclusion and Future Work

In view of the current pork food safety in our country and the shortcomings of the current domestic pork traceability technology, this paper proposed a pork traceability technique based on the feature analysis of convolutional neural networks. We proposed an image registration method based on the convolutional feature: (i) A feature extraction method is constructed using convolutional neural network with the pre-trained VGG. In order to effectively utilize the deep convolutional neural architecture in pork skin image registration applications, our feature descriptors utilize the advanced convolutional information while preserving some local information. (ii) A non-rigid feature point registration method is proposed which uses a gradual expansion of the inlier selection method to quickly determine the coarse transformation through the most reliable feature points in the first stages, and then we have found that the increased feature points can directly improve the registration performance while reducing the impact of outliers. Compared with SIFT, the considerable accuracy is raised from feature pre-matching tests on pork skin image datasets, and image registration tests show that our approach is superior to common methods in most cases.

So far, no similar deep learning solutions have been found in the traceability of pork. By now the tracing success rate of local pigskin images needs to be improved. According to the current research and the samples of the pig skin images held, the registration success rate of the local pig skin image is 86.67%, and there will be a small number of skinned pork that cannot be registered successfully. Experiments show that the proposed algorithm cannot describe the features of pig skin with very low resolution. Also the pork traceability technology can only apply to skinned pork. Some pork sold in the market today does not have pig skin, so future research in this area can explore the pork itself to achieve the traceability of various types of pork. Therefore, in the future development of the algorithm, we will consider the features of pork itself instead of pig skin and try to trace the unmarked pork.