Keywords

1 Introduction

Object recognition refers to obtaining environmental information by a set of sensors, and identifying specific objects in the scene through computer analysis. Its task is to realize the recognition of specific objects in the scene and give the position and posture of the object. The general process includes object detection, feature extraction, and recognition. About 80% come from vision when humans perceive external information. Therefore, vision-based object recognition has become a popular research in recent years, and is widely used in many fields such as robot navigation, industrial inspection, aerospace, military reconnaissance and so on.

As the complexity of the object to be identified increases, the object recognition of the traditional 2D image cannot meet the practical application, while the 3D object recognition can objectively describe the shape and structure and improve the recognition rate. According to the research status at home and abroad, the existing 3D object recognition methods are roughly divided into five categories: geometric or model-based method, appearance or view-based method, feature matching-based method, depth image-based method and intelligent algorithm-based method. The methods are introduced and compared, respectively.

2 Geometric or Model-Based Method

In the process of object recognition, the prior knowledge of the shape and structure of the object is generally called geometric or model-based 3D object recognition [1]. The method obtains a 3D geometric feature description from the input graphic data, and matches it with the model description to achieve recognition and positioning of the object.

Qian [2] proposed a new 3D object recognition method. The method segments a 3D point set into a number of planar patches and extracts the Inter-Plane Relationships (IPRs) for all patches. Based on the IPRs, the High Level Feature (HLF) for each patch is determined. A Gaussian-Mixture-Model-based plane classifier is then employed to classify each patch into one belonging to a certain model object. Finally, a recursive plane clustering procedure is performed to cluster the classified planes into the model objects.

Lin et al. [3] described the geometric component model in a combined way to describe the contour of the object, and established an ordered chain structure to show the degree of matching. The method can solve the matching problem of complex object contours, and the detection time is reduced by 60%–90% compared with previous methods. For a rigid object with a clear outline, the effect is better. Ding [4] used the forward method to establish a 3D scattering center model from the object CAD model offline. This model can effectively predict the object in arbitrary posture.

This method is generally applicable to the object with a regular shape, and the shape comparison is relatively intuitive and easy to understand. However, the algorithm has a large amount of computation, and a geometric model needs to be established, which is not suitable for environments with complex backgrounds and noise interference. When there is occlusion between objects, it will also produce poor recognition results.

3 Appearance or View-Based Method

3.1 Single-View Feature-Based Method

This method is to analyze the observed image of the object by a certain viewpoint, and to identify the object by feature extraction and feature matching. It requires that the object posture is relatively stable and the structure is relatively simple.

Eigen et al. [5] used multi-scale DNN to obtain depth information from single-view images. This method has only been improved in scale and has limitations for other 3D geometric information. Lee et al. [6] proposed an automatic pose estimation method to obtain depth information values from a given single image, suitable for various image sequences containing objects with different appearances and poses. Yan et al. [7] used the point, line and surface information in the image as the correction of the input image to eliminate the distortion problem. However, most are used in symmetrical building scenarios, and the robustness of the algorithm needs to be further improved.

In this recognition system, the identification feature of the higher dimension is generally required to represent the object, and the feature vector is compared with the template feature vector to complete the recognition. Single-view acquisition is susceptible to factors such as viewing angles, lighting and the complex background.

3.2 Multi-view Feature-Based Method

Based on the single-view object recognition, the multi-view compensate for the misidentification under similar 2D images formed by different objects and background occlusion. Feature matching is performed on images from two different viewing angles, which can realize camera calibration and restore the 3D coordinates of spatial points, thereby gradually developing SFM [8], three views [9] and multiple views [10] were following developed.

Chen [11] extracted features and reduced dimensions, then input these features into the SVM [12] for classification and identification, which solved the problems of classification complexity and low recognition efficiency caused by the increase of feature dimension. Zhan [13] extracted multiple features and then used PCA to eliminate the redundant information between the features. Finally, the genetic algorithm-optimized SVM is used for classification and recognition, which improves the accuracy and speed of 3D object recognition.

In order to objectively and accurately identify the object, a larger number of views are usually required, so that the complexity of the classification is significantly improved. If a smaller number of views are used, the recognition accuracy is reduced.

3.3 Optical Operation-Based Method

The basic principle is to obtain 2D graphics or images by optical imaging method, and to identify the object according to the optical characteristic parameters [14]. In the process of recognition, the object to be identified and the template are measured for similarity, and a set of related features are used to determine the category, position and posture of the object.

The classical optical operation, such as the optical flow method, changes the intensity of the light, and the motion is projected onto the image plane after being irradiated, and the change of the optical flow is formed by the pixel variation of the discrete sampling of the sensor. This method has high accuracy and can adapt to the motion situation, but due to the large amount of calculation and the sensitivity to environment, so the application is relatively small.

Zhang [15] encoded the depth information of the 3D object into the 2D image, and used optical 2D image recognition technology to identify the object. However, this method is limited to simple spherical objects, and the impact in practical applications has not been estimated. Vallmitjana [16] designed different filters according to different views of the object, integrated all the data into the object-centered coordinate system. However, excessive use of filters is likely to cause noise limitations in practical applications.

The optical operation recognition speed is fast, and the information can be processed in parallel, but the calculation amount is large and the time is long. Therefore, it is necessary to extract the 2D information based on the 3D object, and finally realize the optical 3D object recognition.

4 Feature Matching-Based Method

4.1 Global Feature-Based Method

The traditional image description method is to select features from a large number of images containing the object that can represent the whole, such as color, texture, etc., and use statistical classification technology to classify the object to achieve the purpose of recognition. The color histogram [17] describes the proportion of each color in the entire image, but it does not clearly describe the specific distribution and spatial position of the color. Texture features [18] describe a surface property that ignores other properties of the object and is highly flawed in the acquisition of high-level images.

The features selected are comprehensively representative, small in computation and easy to implement, but weak in detail resolution, sensitive to occlusion and background, and the object to be recognized is independent and the data is complete, so the application range is limited [19], and may have the following three shortcomings:

  1. (1)

    Under the complex image structure, image segmentation technology affects the object recognition;

  2. (2)

    The amount of learning data is large and the training time is long;

  3. (3)

    When the object undergoes a large deformation, it will cause a sudden change in the global feature.

The model-based and view-based methods mentioned above show disadvantages in this respect.

4.2 Local Feature-Based Method

The local feature refers to the set of attributes that can objectively and stably describe the object, and combines the local features to form the feature vector, thereby realizing the effective representation of the object. The algorithm based on local feature matching has achieved good results in the field of object recognition [20,21,22].

The selected feature points must satisfy the following conditions [23]: (1) Repeatable extraction; (2) It can define a unique 3D coordinate system; (3) Its neighborhood contains valid description information. Subsequent feature point matching can be performed after feature point selection is completed [24]. The most widely used descriptors are the SIFT [25], the SURF [26], the Harris detector [27], the Hessian detector [28], HOG [29] and LBP [30].

Wei et al. [31] extracted the feature description of the invariant angle contour, obtained the feature vector of the object by invariant moment transformation, and compared the cosine of the angle to achieve feature matching. This method can be used for object recognition in complex scenes.

Although existing local feature-based techniques have high accuracy and can handle occlusion and chaos, these methods still have high computational complexity. In order to solve these problems, the literature [32] proposed keypoints-based surface representation (KSR), which does not need to calculate local features, using the geometric relationship between the detected 3D key points to local surface representation, to some extent, it suppresses the noise level.

Local features have good stability and are not easily affected by environmental factors. Even though the amount of data is too large, fast registration can be achieved, but at the cost of algorithm complexity and computational addition. The global feature is invariant, small calculation amount and convenient to understand. Therefore, they can be combined to improve the recognition rate and reduce the calculation amount.

5 Depth Image-Based Method

A narrow depth image [33] is defined as acquiring depth information of an object using a depth sensor such as a microwave or a laser. At present, the methods frequently used for obtaining depth images are stereoscopic vision technology [34], microwave ranging principle and lidar imaging [35].

The more commonly used depth image types are grid representation [36] and point cloud representation [37].

5.1 Grid Representation

The grid consists of points, edges and planes. It is an irregular data structure and has a rich description of the shape and other details.

Fang et al. [38] introduced the grid structure into the multi-view image, so that the grid point position corresponds to the viewpoint image feature vector, and then the model is built according to the local invariant feature statistics of the object. Wang et al. [39] proposed an end-to-end depth learning framework that can generate 3D mesh directly from a single color picture. The CNN is used to represent the 3D mesh, and features are extracted from the input image to produce the correct geometry shape.

Grid data is informative and has a topology. However, when drawing a large scene, performing grid reconstruction will bring about problems such as long calculation time and large amount of information storage.

5.2 Point Cloud Representation

The point cloud is a set of 3D point coordinates of a scene or an object. Due to the huge amount of point cloud scene data, each object contains a large number of features, and each feature corresponds to a high-dimensional description vector, resulting in large computational complexity and low computational efficiency [40].

The PointNet network [41] can process the unordered point cloud and the rotated point cloud data. On this basis, PointNet++ [42] adds a hierarchical structure to the network structure to process local features. The SO-Net [43] network structure simulate the spatial distribution of the point cloud in a self-organizing map (SOM) manner. In ModelNet40 classification, PointNet achieved 86.2%, PointNet++ is remarkably stronger than PointNet, and SO-Net was up to 90.8%.

The graph-based method is a novel method for 3D point cloud object recognition. Wang et al. [44] proposed an EdgeConv module in DGCNN. By stacking or reusing the EdgeConv module, global shape information can be extracted. DGCNN has improved performance by 0.5% over PointNet++. The key to RS-CNN [45] is learning from relation, i.e., the geometric topology constraint among points. RS-CNN reduces the error rate of PointNet++ by 31.2% and with a stronger robustness than PointNet, PointNet++ and DGCNN.

6 Intelligent Algorithm-Based Method

Intelligent algorithm is a kind of engineering practice algorithm realized by computer. It reflects the simulation and reproduction of biological system, human intelligence and physical chemistry. It is widely used in object recognition and image matching. The following is an introduction to several major intelligent algorithms:

6.1 Ant Colony Algorithm

According to the characteristics of ant colonies’ foraging behavior, a population-based simulated evolutionary algorithm was proposed, called Ant Colony Optimization [46].

The idea is that during the foraging process of the ant, information exchanged and transmission will be carried out, and the next walking path will be selected according to the length of the path taken, showing a positive feedback phenomenon [47]. When the ant is unable to move in the next step, the path taken at this time corresponds to a feasible solution in the optimization problem. Zhang et al. [48] combined the relative difference of gradient and statistical mean with image edge detection. The relative difference between the gradient value and the statistical mean is extracted as an ant search for image edge detection. In the future, parallel ACO algorithms can be used to further reduce the computational complexity of the algorithm.

6.2 Particle Swarm Optimization

According to the birds’ foraging behavior, Kennedy and Eberhart proposed Particle Swarm Optimization [49]. Considering the flock of birds as a group of random particles, with the two attributes of direction and distance, with the nearest solution from the food and the optimal solution currently found by the whole population as a reference, the area closest to the food can be regarded as the best solution to the problem.

Due to the rapid loss of diversity, PSO suffers from premature convergence. In order to improve the performance, Wang et al. [50] proposed a hybrid PSO algorithm (DNSPSO), which uses diversity enhancement mechanism and domain search strategies. By combining these two strategies, DNSPSO achieves a trade-off between exploration and development capabilities. Compared to standard PSO, DNSPSO does not increase computation time and has better results on low-dimensional issues.

6.3 Artificial Fish-Swarm Algorithm

The Artificial Fish-Swarm Algorithm [51] was derived from the characteristics of fish movement. Supposing that in a water area, the fish population will gather together according to the behavior of foraging, so the place where the fish population gathers the most is the best nutritional water quality, which is the best solution to the problem.

Due to the computational complexity of the artificial fish algorithm and the slow convergence rate at the later stage, Ma et al. [52] proposed an adaptive vision-based fish swarm algorithm (AVAFSA), which changed the field of view of the fish foraging, and gradually reduced when the algorithm iterated. The small field of view value stops the iteration until the field of view value is less than half of the initial value. The improved algorithm has fast convergence speed and small calculation amount, and is more accurate and stable than the basic AFSF algorithm.

6.4 Genetic Algorithm

The Genetic Algorithm (GA) [53] is an evolutionary algorithm that utilizes the natural laws of the biological world. The parameters in the optimization problem are regarded as chromosomes, and the chromosomes in the population are optimized by iterative methods such as selection, crossover and mutation, and the chromosomes that meet the optimization object are feasible solutions.

Aiming at the defects of GA, the immune genetic algorithm is used to combine the immune algorithm [54] with the GA to solve the problem of premature convergence of GA, to ensure the diversity of the group [55]. Tao et al. [56] combined GA with SVM to classify data classes, and the classification accuracy were greatly improved.

6.5 Simulated Annealing Optimization

Simulated Annealing Optimization [57] simulates the process of heating and cooling solid matter in physics, referring to the solution process of general optimization problems. Shieh et al. [58] proposed a hybrid algorithm combining particle swarm optimization with simulated annealing behavior (SA-PSO), which has good solution quality advantages in simulated annealing and has fast search capability in particle swarms, which can increase efficiency and speed up convergence.

It can be improved by combining with other algorithms, such as the combination with PSO [59], GA [60] and ant colony algorithm [61].

6.6 Neural Networks

The neural network [62] is a mathematical model that simulates the laws of human beings in various things in nature, solves some problems with its working principle, and adjusts the connection relationship between internal nodes to adapt to the processing of different information.

The advantage of the neural network is that it can be self-learning, and the learning rules are simple, easy to implement by computer, and has broad application prospects. The disadvantage is that it is impossible to explain its own reasoning process and reasoning basis. Once the data is insufficient, it will lose the ability to work normally.

Intelligent algorithms are an emerging research direction, and Table 1 lists the comparison of these six algorithms.

Table 1. Comparison of intelligent algorithms

7 Conclusion

Vision-based 3D object recognition has always been a research hotspot in the field of computer vision. According to the foregoing, both method (1) and method (2) can compare shapes intuitively. In the absence of the shape description of the object, method (2) can be used. However, they all require that the object is independent and the data is complete, sensitive to occlusion and background, so the scope of application is limited. In contrast, Method (3) has better robustness in the presence of overlapping and complex backgrounds and has become the most common method. Method (4) embodies the shape contour of the object space, which has the advantages that the ordinary CCD camera does not have and change the idea of 2D image recognition. The difference is that Method (5) uses an optimized strategy combined with the first four methods to improve.

At present, the most widely used recognition methods are object recognition for uniform point cloud distribution or scenes with less objects. The 3D point cloud scene data is sensitive to noise and there is a case where the density distribution is uneven. How to reduce point cloud noise, reduce the impact of uneven density distribution, and how to apply the mature technology in 2D object recognition to 3D point cloud data will be an important research direction. Table 2 lists the comparison of various recognition algorithms.

Table 2. Comparison of various recognition algorithms