Keywords

1 Introduction and Motivation

We introduce an efficient bio-inspired 3D object retrieval approach which can be implemented with very limited memory and processing power. Our motivation is to use ideas (viewer-centric object models with Markovian inference and information fusion) originating from the operation of the brain but also to avoid the complexity of hierarchical deep neural networks as it would be a direct copy of nature’s successful mechanisms. In our research we focus on a relatively simple task: how to recognize/retrieve 3D objects by several 2D views taken from different directions.

Humans have only access to a limited subset of reality due to the limitation of attentional capacity and of sensitivity. As a result our experiences do not replicate the real world but rather create a construction or representation of it with prediction and estimation. One example is the temporal difference (TD) learning algorithm which has received attention in the field of neuroscience a long time ago [16]. TD mechanisms consider that subsequent predictions are correlated in some sense: TD learning adjusts predictions to match other predictions about the future. Evaluation of Markovian processes can be considered as a rough approximation of this, moreover Hidden Markov Models are able to make efficient predictions considering the difference between the real world and its sensations with the help of probability functions. How we make representations of uncertainty greatly depends on the integration of data over time. The extent to which past events are used to represent uncertainty seems to vary over the cortex: primary visual cortex responds to rapid perturbations in the environment, while frontal cortices encode the longer term contexts within which these perturbations occur [9].

Information fusion is also very important in the creation of the brain’s representations. Several examples for this are the different phenomena of vestibular and visual information co-processing. For example during long-drawn head rotations with the eyes closed, the elasticity of the cupula (a structure in the vestibular system providing the sense of spatial orientation) gradually restores it to its upright position. Thus the drive to the optokinetic response stops (misleadingly informing the brain that there is no motion). When opening the eyes in such situations, the world is seen moving and people feel giddy.

While the exact mechanisms of interaction between the different modalities are not always clear for the researchers, we will fuse orientation and visual information in a viewer-centered object recognition model. In cognitive science the recognition of objects from different views are described by two competing theories: according to the so-called object-centered approach [1] the structural description of simple parts play important role without explicit object representation from the different views. This can be imagined similar to the computer vision algorithms for object recognition with SIFT-like features [12]. In contrary, viewer-centered theory, supported e.g. by [7], suggests that this is done based on matching specific views to a set of templates, which requires explicit viewer-specific object representations.

Convolutional Neural Networks have very strong biological motivation and have been extensively used for image-based recognition, detection, retrieval, and image segmentation. However, their complexity (also energy and memory requirements) is quite large to be applied for real-time image based recognition in embedded or mobile systems. What we knew before and has been shown experimentally in recent developments is that simple approximations to input or internal data representations can still result in satisfactory performance. For example the so called XNOR-Networks, where both the filters and the input to convolutional layers are binary, run 58\(\times \) faster convolutional operations and show 32\(\times \) memory savings [15].

Following the above bio-inspired concepts we introduce a retrieval model with the following features:

  • it is viewer-centered with the storage of very limited number of 2D views,

  • it fuses visual and orientation information,

  • it utilizes the inference in temporal sequences of signals (Markovian),

  • the relation of observations and hidden model states can be estimated with simple correlations,

  • it relies on compact descriptors computed very fast,

  • it can be successfully used for real-time video object retrieval with lightweight devices.

There are two main reasons we are not using deep neural network models. First, we can implement the aimed concepts (Markovian inference, information fusion, viewer-centric object models) very efficiently in a HMM framework. Second, we have knowledge of efficient compact descriptors, and can use the orientation information directly in the Markovian model for the temporal support (as explained later), i.e. there is no need for time consuming training and optimization of millions of parameters of the neural structures.

In the next Section we give a short overview of related papers. Then the proposed object views, as hidden states of a Markov model, state transitions, observable features, and the decoding and retrieval steps are defined in consecutive subsections. Section Experiments and Evaluations contains experimental data and analysis followed by Summary.

2 A Brief Overview of Related Papers

Optical object retrieval and recognition is a very large topic with thousands of theoretical articles and applications, now we focus only on some which are closely related to our aims and motivations.

HMMs are often used in different recognition problems such as speech, musical sound, or human activity recognition but we relatively rarely meet them in the recognition of 2D or 3D visual objects. This is natural since ordered sequences of features are needed to construct HMM models. In [10] affine invariant image features are built on the contours of objects, and the sequence of such features are fed to the HMM. This approach is interesting but seemed to be too unnatural to have later followers.

In [5] authors presented an approach for face recognition using Singular Values Decomposition (SVD) to extract relevant face features, and HMMs as classifier. In order to create the HMM model the 2 dimensional face images had to be transformed into 1 dimensional observation sequences. For this purpose each face image was divided into overlapping blocks with the same width as the original image and a given height, and the singular values of these blocks were used as descriptors. A face image was divided into seven distinct horizontal regions: hair, forehead, eyebrows, eyes, nose, mouth and chin forming seven hidden states in the Markov model. While the algorithm was tested on two standard databases, the advantage of the HMM model over other approaches was not discussed.

The method of Torralba et al. [17] seems to be more close to a real-life temporal sequence: HMM was used for place recognition based on the sequences of visual observations of the environment created by a low-resolution camera. It was also investigated how the visual context affects the recognition of objects in complex scenes. There is no doubt that this approach has real cognitive motivation and relevance compared to those above.

Gammeter et al. [8] used accelerometer and magnetic sensor to help the visual recognition of the landscape. Clustered SURF (Speeded Up Robust Features) features were quantized using a vocabulary of visual words, learnt by k-means clustering. For tracking objects the FAST corner detector was combined with sensor tracking. Because of the small storage capacity of the mobile device a server-side service was used to store the necessary information for the algorithm.

It is obvious that video gives much more visual information about 3D objects than 2D projections. Local feature descriptors (like SIFT, FAST, etc.) are often used for view-centered recognition. In [14] the underlying topological structure of an image set was represented as a neighborhood graph of local features. Motion continuity in the query video was exploited for the recognition of 3D objects.

The most similar viewer-centered HMM based 3D object retrieval method to ours was published by Jain et al. [11]. However, there are many differences to our work and many ambiguous details in [11]: it is not clear how the crucial emission and transition probabilities were estimated and also the dimension of the applied image descriptor (13) seems to be too small for real-life applications. The dataset in their tests included only gray-scale CGI without texture and no orientation sensor was used during the recognition.

Our early work, to utilize orientation information for object retrieval, can be found in [3]. Later we modified our method [4] to maximize a fitness function over a sequence of observations, based on the Hough transformation paradigm. While, as we have demonstrated by the above examples, the use of HMMs for object recognition is often a bit unnatural, turning our previous Hough framework to HMM is obvious and is also biologically motivated. As will be shown, our recent HMM model has better hit-rate and smaller complexity and encapsulates the bio-inspired concepts described above.

3 Object Retrieval with HMM

To achieve object retrieval will need to build HMM models for all elements of the set of objects (M). Then, based on observations, we find the most probable state sequence for all objects models. The state sequence among these, which is the most similar to the observation sequence, will belong to the object being retrieved.

3.1 Object Views as States in a Markov Model

Let \(S = \{S_1,\cdot \cdot \cdot ,S_N\}\) denote the set of N hidden states of a model. In each t index step this model is described as being in one \(q_t\in S\) state, where \(t = 1,\cdot \cdot \cdot , T\).

In our approach the states can be considered as the 2D views (or the average of some neighboring views) of a given object model. This can be easily imagined as a camera is targeting towards and object from a relative elevation and relative azimuth. The number of possible states should be kept low, otherwise the state transition matrix (\(\mathbf A \)) would contain too small numbers and finding the most probable state sequence would be too unstable. On the other hand, small number of states would mean that quite different views of some objects should be represented by the same descriptors, resulting in decreased similarity of model these views and actual test observations. Thus it is easy to see that the generation of states should be designed carefully. Often Gaussian mixtures are used to combine the views of similar directions. Now we use static subdivision of the circle of 360\(^\circ \), into 2, 4, 6, and 8 uniform parts with 180\(^\circ \), 90\(^\circ \), 60\(^\circ \), and 45\(^\circ \) correspondingly, with surprisingly good results as given in Sect. 4.

We define the initial state probabilities \(\mathbf {\pi }=\{\pi _i\}_{1\le i\le N}\) based on the orientation range of states:

$$\begin{aligned} \pi _i=P(q_1=S_i)=\frac{\alpha (S_i)}{360} \end{aligned}$$
(1)

where \(\alpha (S_i)\) is the size of orientation aperture of state \(S_i\) given in degree.

3.2 State Transitions

Between two steps the model can undergo a change of states according to a set of transition probabilities associated with each state pairs. In general the transition probabilities are:

$$\begin{aligned} a_{ij}=P(q_t=S_j|q_{t-1}=S_i) \end{aligned}$$
(2)

where i and j indices refer to states of the HMM, \(a_{ij} \ge 0\), and for a given state \(\sum _{j=1}^{N} a_{ij} = 1\) holds. The transition probability matrix is denoted by \(\mathbf A =\{a_{ij}\}_{1\le i,j\le N}\).

To build a Markov model means learning its parameters (\(\mathbf {\pi }\), \(\mathbf A \), and emission probabilities introduced later) by examining typical examples. However, our case is special: the probability of going from one state to an other severely depends on the user’s behavior, interest and also on the frame rate of the camera. Thus we can not follow the traditional way, to use the Baum-Welch algorithm for parameter estimation based on several training samples, but can directly compute transition probabilities based on geometric probability as follows.

First define \(\varDelta _{t-1,t}\) as the orientation difference between two successive observations:

$$\begin{aligned} \varDelta _{t-1,t}= \alpha (o_t)-\alpha (o_{t-1}). \end{aligned}$$
(3)

Now define \(R_i\) as the aperture interval belonging to state \(S_i\) by borderlines:

$$\begin{aligned} R_i = [ S_i^{min}, S_i^{max}[. \end{aligned}$$
(4)

The back projected aperture interval is the range of orientation from where the previous observation should originate:

$$\begin{aligned} L_j = [ S_j^{min}-\varDelta _{t-1,t}, S_j^{max}-\varDelta _{t-1,t}[. \end{aligned}$$
(5)

Now we have arrived to estimate the transition probability by the geometrical probability concept applied on the intersection of \(L_j\) and \(R_j\):

$$\begin{aligned} a_{ij}=P(q_t=S_j|q_{t-1}=S_i)=\frac{\alpha (L_j\cap R_i)}{\alpha (L_j)}. \end{aligned}$$
(6)

Please see Fig. 1 for illustration.

Fig. 1.
figure 1

Geometrical interpretation of transition probabilities.

3.3 Hidden States Approximated by Observations with Compact Descriptors

The appearance of objects may significantly differ from those made during model generation under controlled circumstances. The changes in illumination, color balance, viewing angle, geometric distortion and image noise can result in heavily distorted feature descriptors. Thus observations only resemble the descriptors of the model states. Let \(O=\{o_1,o_2,\cdot \cdot \cdot ,o_T\}\) denote the set of observation sequence. The emission probability of a particular \({o}_t\) observation for state \(s_i\) is defined as

$$\begin{aligned} b_i(o_t)=P(o_t|q_t=S_i) \end{aligned}$$
(7)

In [4] we have shown that the CEDD (Color and Edge Directivity Descriptor) [2] is a robust low dimensional descriptor for object recognition. Being area based, pixels are classified into one of 6 texture classes (non-edge, vertical, horizontal, 45 and 135\(^\circ \) diagonal, and non-directional edges). For each texture class a (normalized and quantized) 24 bin color histogram is generated, each bin representing colors obtained by the division of the HSV color space, resulting in feature vectors of dimension 144 (6\(\times \)24). The similarity of CEDD vectors is computed by the Tanimoto coefficient:

$$\begin{aligned} T(e_i,c_j)=\frac{e_i^T c_j}{e_i^T e_i+c_j^T c_j-e_i^T c_j} \end{aligned}$$
(8)

where \(e_i^T\) is the transpose vector of the query descriptor and \(c_i\) denotes the descriptors of object views. Rotational invariance can be achieved as given in [3]. Now Eq. 7 can be rewritten as:

$$\begin{aligned} b_i(o_t)=\frac{T(C(S_i),C(o_t))}{\sum _{j=1}^{N} T(C(S_j),C(o_t))} \end{aligned}$$
(9)

where C stands for the descriptor generating function of CEDD. Since each model state can cover a large directional range we will use the average CEDD vector, of available model samples within, to represent the whole state with a single descriptor.

Now we have the complete set of parameters of all HMMs denoted by \(\mathbf {\lambda }_k = (\mathbf A ,b,\mathbf {\pi })\), \(k\in M\). The task is to find the most probable state sequence \(\hat{S}_k\), for all possible candidate objects, based on observations.

3.4 Decoding for Retrieval

We use the well-known Viterbi algorithm to get the state sequence with the maximum likelihood. The variable \(\delta _t\) gives the highest probability of producing observation sequence \(o_1,o_2,\cdot \cdot \cdot ,o_t\) when moving along a hidden state sequence \(q_1,q_2,\cdot \cdot \cdot ,q_{t-1}\) and getting into \(q_t = S_i\), i.e.

$$\begin{aligned} \delta _t(i)=\max P(q_1,q_2,\cdot \cdot \cdot ,q_t=S_i,o_1,o_2,\cdot \cdot \cdot ,o_t|\lambda ) \end{aligned}$$
(10)

It can be calculated inductively as

  1. 1.

    Initialization:

    $$\begin{aligned} \delta _1(i)=\pi _ib_i(o_1), \quad 1\le i\le N \end{aligned}$$
    (11)
  2. 2.

    Recursion:

    $$\begin{aligned} \delta _{t+1}(j)= b_j(o_{t+1}) \max _i[a_{ij}\delta _{t}(i)], \quad 1\le j\le N \end{aligned}$$
    (12)

Finally, we can choose the most probable state \(\hat{i}\) ending at T:

$$\begin{aligned} \hat{i}=arg \max _i[\delta _T(i)] \end{aligned}$$
(13)

To achieve object retrieval we have to find the most probable state sequence \(\hat{S}_k\) with the above steps for all possible candidate objects. Now, to select the winner object, we have to compare the observations with the most probable state sequence:

$$\begin{aligned} \hat{k} = arg \max _{\forall k\in M}(\frac{\sum _{i=1}^{N} T(C(o_i),C(\hat{S}_{k,i})}{N}) \end{aligned}$$
(14)

4 Experiments and Evaluations

4.1 Test Dataset

The COIL-100 dataset [13] includes 100 different objects; 72 images of each object were taken at pose intervals of 5\(^\circ \). We evaluated retrieval with clear and heavily distorted queries using Gaussian noise and motion blur. The imnoise function of Matlab, with standard deviation \(sd=0.012\), was used to generate additive Gaussian noise (GN) while motion blur (MB) was made by fspecial with parameters \(len=15\), and angle \(\theta =20^\circ \). Some examples of the queries are shown in Fig. 2.

For different tests different numbers (2, 4, 6, 8) of hidden states were generated by equally dividing the full circle. Each state was represented with its average CEDD descriptor vector.

To estimate the relative orientation of the camera we used the same built in IMU (Inertial Measurement Unit) sensor as in [4] with around 4.5\(^\circ \) average absolute error with a 5.25\(^\circ \) variance. The evaluation of our method with textured and varying backgrounds is for future work.

Fig. 2.
figure 2

First three lines: clear samples from COIL-100. \(4^{th}\) line: example queries loaded with Gaussian additive noise. \(5^{th}\) line: example queries loaded with motion blur.

Fig. 3.
figure 3

First graph: average hit-rate with clear samples from COIL-100. Second graph: queries loaded with Gaussian additive noise. Third graph: queries loaded with motion blur.

4.2 Hit-Rate

The hit-rate of retrieval is measured by taking the average of 10 experiments with all 100 objects with randomly generated queries (the orientation angle of subsequent queries were increased monotonically). As shown in Fig. 3 for different quality queries, as the number of queries increases the hit-rate increases monotonically. It is also true that higher number of states gives better results. We tested no more states than 8, where it reached the maximum performance in most cases.

For comparison with the method of [4] we included the best results of the Voting Candidates algorithm denoted by VCI. There is an obvious 2–6% gain over VCI observable. Please note, that the same visual descriptors and orientation sensor was used by VCI in previous tests.

4.3 Running Time and Memory Requirements

Tests were run on a Samsung SM-T311 tablet equipped with Android 4.2.2 Jelly Bean, 1 GB RAM, and ARM Cortex A9 Dual-Core 1.5 GHz Processor. No code optimization or parallelism was carried out and only the CPU was used during calculations. As given in Table 1 even for 8 queries the whole processing chain is within 1 second on the specified mobile computing hardware. This is a fraction of the complexity of VCI [4].

The advantage of using compact descriptors is the very limited memory requirement of object models. A CEDD descriptor occupies 144 Bytes in memory and orientation can be stored in 4 Bytes. For 100 objects and 8 states we need to store roughly 120 KB (100 \(\times \) 8 \(\times \) 148 Bytes).

Table 1. Running times in seconds for the retrieval of one object from 100.

5 Summary

The main purpose and contribution of our paper is twofold:

  • building a bio-inspired object retrieval framework with Markovian inference and multimodal information fusion in a viewer-centric model, and

  • showing that its implementation is robust and resource efficient to be used in mobile devices.

We presented our first results over a dataset of 100 3D objects with 7200 views using clear and noisy queries. While results are better than with our previous model, still there is a lot to do: we are developing a clustering technique to build optimal states instead of the uniformly distributed states and should work also on automatic object segmentation and tracking.