Scanpath estimation based on foveated image saliency

Wang, Yixiu; Wang, Bin; Wu, Xiaofeng; Zhang, Liming

doi:10.1007/s10339-016-0781-6

Scanpath estimation based on foveated image saliency

Short Communication
Published: 14 October 2016

Volume 18, pages 87–95, (2017)
Cite this article

Download PDF

Cognitive Processing Aims and scope Submit manuscript

Scanpath estimation based on foveated image saliency

Download PDF

Yixiu Wang^1,2,
Bin Wang^1,2,
Xiaofeng Wu^1,2 &
…
Liming Zhang^1,2

682 Accesses
12 Citations
Explore all metrics

Abstract

The estimation of gaze shift has been an important research area in saliency modeling. Gaze movement is a dynamic progress, yet existing estimation methods are limited to estimating scanpaths within only one saliency map, providing results with unsatisfactory accuracy. A bio-inspired method for gaze shift prediction is thus proposed. We take the effect of foveation into account in the proposed model, which plays an important role in the search for dynamic salient regions. The saccadic bias of gaze shifts and the mechanism of inhibition of return in short-term memory are also considered. Based on the probability map derived from these factors, candidates for the next fixation can be randomly generated, and the final scanpath can be acquired point by point. By the evaluation of objective measures, experimental results show that this method possesses better performance in several datasets than many existing models do.

Introduction

A vast amount of visual data enters the human eyes every second. It is far beyond the ability of human brains to process such data in real time. The human visual system automatically extracts important information from a great deal of visual inputs for further processing in the cerebral cortex, a process known as selective attention. Attention shift, a selective attention behavior, has become a research hotspot. Actually, a scanpath is made of fixations and gaze shifts (saccades). Fixations are gaze locations, which indicate the attractive regions in the scene. Scanpaths are the sequences of these locations, which mainly reflect the visit order of these attractive regions. Research on shifts of visual attention focuses on the regions of the scene that draws people’s attention and uncovers how people select these regions.

Within the past 2 decades, various computational models of visual attention have been proposed. Among all kinds of visual attention models, space-based models (Borji and Itti 2013) focus on the density and the sequence of fixations. This research has helped to explain the biopsychological mechanisms of visual attention behaviors in humans and made large contributions to modern applications such as Web site advertisement design and robot vision.

Based on different objects of research, space-based models can be divided into two kinds: fixation density prediction and saccadic scanpath estimation models.

The aforementioned fixation density prediction models calculate image saliency statically. Based on the feature-integration theory of attention (Treisman and Gelade 1980), Itti et al. (1998) first proposed a biologically plausible computational saliency model, which simulates the working process of the early visual areas. Motivated by Itti’s model, a graph-based visual saliency (GBVS) model was proposed (Harel et al. 2007). In this model, saliency is defined as the stationary distribution of a fully connected graph. However, both classical models mentioned above have very high computational complexity. Therefore, a faster model using spectral residual was put forward (Hou and Zhang 2007). This approach suggests that the information of image saliency is mainly contained in residual amplitude spectrum. However, many other researchers emphasize the importance of phase spectrum. Guo et al. (2008) proposed a model using the phase spectrum of quaternion Fourier transform (PQFT), which is even faster and more accurate than Hou and Zhang’s method in local saliency detection.

Saccadic scanpath estimation models can predict both the density and the order of the fixations. Yet, the research on gaze shifts has not been systemically and fully developed so far. The extended method of Itti’s static model (Saliency Toolbox, STB, Walther and Koch 2006) uses the winner-takes-all and the inhibition of return (IoR) mechanism to simulate scanpaths. In this model, the next fixation point is chosen based on the maximal value in the saliency map with the current fixation region inhibited. The whole scanpath is computed using one static saliency map, disregarding the changes of saliency caused by gaze shifts. In another model (Da Silva and Courboulay 2013) based on Itti’s, the predator–prey method is used to simulate the dynamic competition between different stimuli. A feedback criterion is also added for generating fixations in scene exploration. However, this model depends on too many parameters. A method based on the principle of information maximization was proposed (Wang et al. 2011), making use of independent component analysis (ICA) filters while selecting the next fixation by the maximal residual information between reference and filter response. It relies on ICA filter responses that, however, require large numbers of natural image blocks in model training, increasing the computational burden. In a recent paper (Engbert et al. 2015), a simple saccadic saliency strategy was presented. It reduces the computational time by using grids to simplify the image. In this model, real eye tracking data are used to verify the statistical structure of eye movements. Strictly speaking, it is a saccadic strategy based on existing saliency maps. A saccadic model for free-viewing condition and its advanced model (Le Meur and Liu 2015; Le Meur and Coutrot 2016) were also proposed. The two models emphasize the oculomotor biases of real eye tracking data. Other more complex models have been reported; including one that uses ICA to find the location of the maximal super-Gaussian components measured by the kurtosis function as the next fixation (Sun et al. 2014), as well as another using trained hidden Markov model to obtain the next fixation (Liu et al. 2013).

Existing methods are limited to using only one static saliency map and yield unsatisfactory accuracy. To tackle the problems, we propose an effective bio-inspired scanpath estimation method, which gives sufficient consideration to the dynamic influence of fovea on saliency. The bias of gaze shifts (Le Meur and Liu 2015) and the mechanism of IoR in short-term memory are also considered in the proposed model. By integrating these factors, the probability map of gaze shift can be acquired for the generation of candidates of the next fixation. Besides some traditional similarity criteria, several novel and objective criteria for comparing fixation sequences are introduced in the experiments. Experimental results show that under the evaluation of such measures, our method performs more accurately in several datasets than many existing models do.

Related works

Researchers have been investigating the biases of saccades for a long time and found many characteristics of saccades (Tatler and Vincent 2008; Bays and Husain 2012). Among all these researchers, Le Meur and Liu (2015), Le Meur and Coutrot (2016) conducted relatively complete experiments. The former model (Le Meur and Liu 2015) contributes a lot in computing the joint distribution of the distance and direction of gaze shifts, using the kernel density estimation (KDE) toolbox (Botev et al. 2010). Let the current fixation be $\varvec{x}_{t}$ and the next one $\varvec{x}_{t + 1}$. The norm of saccadic vector $\varvec{x}_{t + 1} - \varvec{x}_{t}$ is defined as shift distance (visual angle) $d$, and the angle from the abscissa is defined as shift angle $\phi$. The estimated distribution is represented as

$$p_{\text{KDE}} (d,\phi ){ = }\frac{1}{n}\sum\limits_{i = 1} {K_{h} (d - d_{i} ,\phi - \phi_{i} )} ,$$

(1)

where $n$ is the total number of samples $(d_{i} ,\phi_{i} )$, and $K_{h}$ is a Gaussian kernel. Besides the distribution of gaze shifts, this model also adopted saliency maps computed by GBVS model (Harel et al. 2007) and the IoR mechanism, which have been used in other models.

The more recent model (Le Meur and Coutrot 2016) investigated the saccadic biases of different categories. Since the former model simply used the GBVS model (Harel et al. 2007) to compute saliency maps, which is relatively slow and inaccurate, this model used a more accurate saliency model to achieve better results. Yet, the dynamic influence of the fovea has not been considered. Only one static saliency map is used in the model when it estimates the scanpath.

In our model, the described saccadic bias is also adopted. The difference from Le Meur’s model is that the dynamic saliency maps varying with foveated locations are used to obtain more accurate results. Furthermore, several objective criteria for comparing estimated saccades are introduced in order to conduct a fairer measure.

Proposed method

The fovea, a tiny region located in the center of the retina with the highest visual resolution, is responsible for detailed central vision in human observing activities. Visual resolution drops with retinal eccentricity, so the area far away from center has much lower resolution (Larson and Loschky 2009). In computational models, it is more realistic to calculate the saliency of the image on the retina rather than from the original image. When people observe a scene, the object corresponding to the current fixation will fall on the fovea, enabling the human brain to obtain detailed information. Because a region far away from the fovea has low visual resolution, the retinal imaging is different from the original scene. Many existing methods adopt only one static saliency map to predict scanpath without considering the dynamic properties. Thus, we propose a method to estimate scanpaths using foveated image saliency. Foveated image can simulate the real imaging on retina when people gaze at one point. This factor plays an important role in the prediction of the next fixation.

Moreover, biases of the distance and the direction of gaze shifts have been investigated in recent researches (Tatler and Vincent 2008; Bays and Husain 2012; Le Meur and Liu 2015). The saccadic biases in the model of Le Meur and Liu give a new perspective of adopting this element in estimating scanpaths. On two famous eye fixation datasets, saccadic biases are analyzed and modeled using kernel density estimation. The bias of gaze shifts is used as an additional factor in our model.

Short-term memory has been proved to be influential on gaze shifts (Bledowski et al. 2009). It can store and integrate the information of past eye movements, so that the past fixations will not be accessed again after a short time. This mechanism exists to prevent the human eyes from revisiting the same location, with its influence fading over time. This mechanism named IoR has been used in several existing models and is also employed in our model.

The probability maps contributed by these factors mentioned above are used to acquire the final probability map for gaze shifts. We generated several candidate points based on the probability map and selected the highest saliency gain location as the next fixation. The detailed approach of every factor and the procedure of our model will be described in the following subsections.

Foveated image saliency

In order to simulate the property of retina, we introduce the foveated image (Geisler and Perry 2002) as the first factor. Multi-resolution pyramid method is used to generate foveated images. The original image is denoted as $\varvec{I}_{\text{origin}}$. We convolve $\varvec{I}_{\text{origin}}$ with multi-scale kernels $\varvec{G}_{l}$, where $l$ represents the index of each layer. The resolution map in each layer is calculated as follows:

$$\varvec{P}_{l} = \left\{ {\begin{array}{*{20}l} {\varvec{I}_{\text{origin}} , \, } \hfill & {l = 0 \, } \hfill \\ {\varvec{G}_{l} * \varvec{I}_{\text{origin}} ,} \hfill & {l = 1,2, \ldots ,L} \hfill \\ \end{array} } \right.$$

(2)

where * denotes the convolution operation, and $L$ represents the total number of the layers in the multi-resolution pyramid. Let $\varvec{x}^{F}$ be the current fixation location. We use $e$ to denote the distance between any point $\varvec{x}$ and the fixation $\varvec{x}^{F}$. $\sigma_{l}$ corresponds to the radius of each kernel. Let $\varvec{W}_{l}$ be the weight matrix of each layer $l$, each element of $\varvec{W}_{l}$ is defined as follows:

$$w_{l} (\varvec{x}) = \left\{ {\begin{array}{*{20}l} {\exp \left( { - \frac{{e^{2} }}{{2\sigma_{l}^{2} }}} \right),} \hfill & {l = 0} \hfill \\ {\exp \left( { - \frac{{e^{2} }}{{2\sigma_{l}^{2} }}} \right) - \exp \left( { - \frac{{e^{2} }}{{2\sigma_{l - 1}^{2} }}} \right),} \hfill & {l = 1,2, \ldots ,L - 1} \hfill \\ {1 - \exp \left( { - \frac{{e^{2} }}{{2\sigma_{l - 1}^{2} }}} \right),} \hfill & {l = L} \hfill \\ \end{array} } \right.$$

(3)

In the final foveated image $\varvec{I}_{\text{foveated}}$, the value of every pixel is computed as the weighted sum of the different layers in pyramid (as shown in Fig. 1b). The pixel value corresponding to location $\varvec{x}$ in the foveated image $\varvec{I}_{\text{foveated}}$ can be described as:

$$i_{\text{foveated}} (\varvec{x}) = \sum\limits_{l = 0}^{L} {w_{l} (\varvec{x}) \times p_{l} (\varvec{x})} .$$

(4)

where $\times$ is used to denote multiplication. The calculated foveated image (as shown in Fig. 1c) simulates the real imaging on retina for the fixation $\varvec{x}^{F}$. The marked point in Fig. 1c represents the current fixation.

In theory, any static saliency model can be used to acquire the saliency maps of the foveated images. In order to reduce computational complexity, the simple and fast PQFT method (Guo et al. 2008) is chosen in our model. The quaternion representation of the input image in Lab color space is denoted as $\varvec{Q}$. Then, a quaternion Fourier transformation is performed. The phase angles are maintained, while the moduli for all frequency components are set to unity. Then, the image is recovered by inverse quaternion Fourier transformation to get $\varvec{Q}'$. $\varvec{P}_{\text{FS}}$ is acquired by:

$$\varvec{P}_{\text{FS}} = G\left( {\left\| {\varvec{Q}^{{\prime }} } \right\|} \right),$$

(5)

where $G\left( \cdot \right)$ denotes the Gaussian filter.

Using fast Fourier transform, PQFT method is very practical and effective in finding small saliency objects. Following the steps above, the first factor $\varvec{P}_{\text{FS}}$ can be acquired.

Saccadic biases and inhibition of return

Existing researches (Tatler and Vincent 2008; Bays and Husain 2012; Le Meur and Liu 2015; Le Meur and Coutrot 2016) have found that gaze shifts are not irregular. There exist some distance and direction biases for these fixation shifts. Through the analysis of eye tracking datasets, several characteristics of saccades can be obtained.

Le Meur and Liu (2015) used the kernel density estimation (KDE) toolbox (Botev et al. 2010) to compute the probability distribution of saccadic biases. The estimated distribution is represented as $p_{\text{KDE}} (d,\phi )$. Based on their research results, for given a current fixation $\varvec{x}_{t}^{F}$, the element corresponding to location $\varvec{x}$ of the probability distribution map $\varvec{P}_{\text{SB}}$ is shown as:

$$p_{\text{SB}} (\varvec{x}) = p_{\text{KDE}} (\text{norm}(\varvec{x}_{t + 1} - \varvec{x}_{t} ),{\text{angle}}(\varvec{x}_{t + 1} - \varvec{x}_{t} )),$$

(6)

where $\text{norm}( \cdot )$ and ${\text{angle}}( \cdot )$ represent the operation of the norm and the angle, respectively. The above calculation gives the probability distribution of saccadic biases $\varvec{P}_{\text{SB}}$, which is used as an additional factor in our model.

The IoR mechanism is taken into consideration in choosing the next fixation point within our model. The past fixations have little chance of being visited in a short time. A linear forgetting transition probability is used to simulate the IoR mechanism. The probability map is recovered linearly when fixations are generated one by one. Our model computes the linear forgetting probability. It takes into account the past $T$ fixations. And the inhibition effect of the previous fixations will gradually fade. Let $\varvec{x}_{t - \tau }^{F}$ denote fixation location of the past $\tau$ th moment, and $\varvec{x}$ be the coordinate of arbitrary location. Let $\sigma_{\text{M}}$ be the standard deviation of two-dimensional Gaussian distribution. Each element of the forgetting area $\varvec{M}_{\tau }$ of the past fixation $\varvec{x}_{t - \tau }^{F}$ and its neighborhood, which should be subtracted from the probability map, is shown as:

$$m_{\tau } (\varvec{x}) = \exp \left( { - \frac{{\left\| {\varvec{x} - \varvec{x}_{t - \tau }^{F} } \right\|^{2} }}{{2\sigma_{\text{M}}^{2} }}} \right) .$$

(7)

Let $\{ \varvec{x}_{i}^{F} \}_{i = t}^{t - (T - 1)}$ denote the locations of previous fixations. $\varvec{1}$ is the matrix in which every element is one. The probability distribution at the next fixation $\varvec{P}_{\text{IOR}}$ can be formulated as follows:

$$\varvec{P}_{\text{IOR}} = N\left| {\varvec{1} - \sum\limits_{\tau = 0}^{T - 1} {\frac{T - \tau }{T}} \varvec{M}_{\tau } } \right|,$$

(8)

where $N\left| \cdot \right|$ means clipping and normalization. As time passes, the influence of past fixations gradually fades. When we compute the IoR map for the next fixation, the current forgetting area $\varvec{M}_{\tau }$ is removed from the map, and the past forgetting areas, which were removed before, are linearly restored.

Strategy for choosing the next fixation candidates

According to the above factors, three probability maps have been computed. Assuming that each of them is independent, the transition probability can be computed as follows:

$$p(\varvec{x}) = p_{\text{FS}} (\varvec{x}) \times p_{\text{SB}} (\varvec{x}) \times p_{\text{IOR}} (\varvec{x}) .$$

(9)

There are two strategies in the selection of the next fixation according to the probability map $\varvec{P}$. One is choosing the point with the highest value in $\varvec{P}$. The other is taking one candidate point with the highest saliency gain with respect to previous fixation among several candidate points generated based on random sampling (Le Meur and Liu 2015). Due to the randomness of the human visual system, the transition probability can only represent the trend in repeated tests, but cannot decide the next fixation in one test. When using the former method, the calculated scanpaths will tend to be periodic in a number of points that have a local maximum. Therefore, the latter method is chosen. Based on the final probability map, 10 candidate points are randomly picked. The candidate with the highest value of $p_{\text{FS}} (\varvec{x})$ is chosen as the next fixation.

Implementation

The implementation details and parameter selection in the experiments are described in this subsection. Steps of the proposed algorithm are described and listed in Table 1.

Table 1 Proposed algorithm

Full size table

Figure 2 shows the framework of the proposed model and gives the probability maps of each step. In the final probability map in Fig. 2, the current fixation (represented as triangle), candidates (represented as square), and the next fixation (represented as circle) are marked. It can be seen that the combination of the foveated saliency and other two factors is effective in choosing the next fixation.

A 4-layer pyramid is used in the computation of the foveated image. Excessive layers are avoided since they are not necessary for obtaining the saliency map and will lead to increased amount of computation. KDE toolbox proposed by Botev et al. (2010) is used in computing saccadic biases. Five previous fixations are taken into account in the IoR mechanism. It is because one fixation duration lasts about 300 ms and the IoR effect lasts about 1.5 s to 3 s (Samuel and Kat 2003). The length of the real eye tracking data in the datasets is about 3–12 points, so it is unnecessary to use more than five previous fixations in our experiments. The area parameter of IoR $\sigma_{\text{M} }$ refers to that in the STB model (Walther and Koch 2006). The number of the candidates affects the randomness of the estimated results. When the candidate number increases, the results of the second strategy are close to those of the first strategy. As the candidate number approaches one, the randomness increases and the candidates may fail to hold the salient points. Thus, we fix the candidate number at 10, which gives the best performance across the experiments. For every input image, one generated scanpath consists of 8 fixations, which are represented by their locations $(x,y)$.

Results

Similarity metrics

In our experiments, both the locations and the orders of fixations are taken into account in choosing similarity metrics.

If we regard scanpaths as the set of points without considering the order, sAUC (shuffled area under ROC curve) (Borji et al. 2013) can be used to measure the difference between estimated results and ground truth data. sAUC is proposed as an improved metric of uniform AUC (area under ROC curve). The higher sAUC score, the better result.

When we take into account the order of the fixations, one sequence is estimated for each image. The ground truth data of eye tracking are the scanpaths of several users viewing one image. Therefore, we evaluate the estimated sequence with that given by each of the viewers and average all the evaluation scores to yield an overall score of this image.

Hausdorff distance and mean minimal distance (Wang et al. 2011) are the metrics of measuring the similarity degree of two sets. They can also be used in comparing sequences. The sequence $X = (\varvec{x}_{1} ,\varvec{x}_{2} , \ldots ,\varvec{x}_{n} )$ (or $Y = (\varvec{y}_{1} ,\varvec{y}_{2} , \ldots ,\varvec{y}_{m} )$) is divided into pieces of length $k$. $C_{x}^{k} (t) = (\varvec{x}_{t} ,\varvec{x}_{t + 1} , \ldots ,\varvec{x}_{t + k - 1} )$ is defined as the $k$-dimensional vector which starts from the $t$ th fixation of sequence $X$. The model space $\{ C_{x}^{k} (t)\}_{t} \subseteq R^{k}$ can be obtained by varying $t$. Similarly, we can get the model space of sequence $Y$. In Eqs. (10) and (11), $d_{\text{H}}^{k}$ represents Hausdorff distance, and $d_{\text{MM}}^{k}$ represents mean minimal distance:

$$d_{\text{H}}^{k} = \mathop {\hbox{max} }\limits_{t} \left\{ {\mathop {\hbox{min} }\limits_{\tau } \left\{ {\left\| {C_{x}^{k} (t) - C_{y}^{k} (\tau )} \right\|} \right\}} \right\}/k,$$

(10)

$$d_{\text{MM}}^{k} = E_{t} \left\{ {\mathop {\hbox{min} }\limits_{\tau } \left\{ {\left\| {C_{x}^{k} (t) - C_{y}^{k} (\tau )} \right\|} \right\}} \right\}.$$

(11)

The former computes the maximal value of all the minimal distances between two sets, while the latter computes the mean value. A smaller distance represents a better prediction.

The metric of ScanMatch (Cristino et al. 2010) is recommended by Anderson et al. (2015), who gave an overview and comparison of existing metrics of measuring the sequence similarity. In our experiment, we focus on the position of fixations, the amplitude and the direction of saccade, as well as the order of the fixations. Therefore, based on the recommendation of the mentioned research (Anderson et al. 2015), ScanMatch is chosen as the similarity metric of scanpaths. ScanMatch is a method for comparing fixation sequences based on the Needleman–Wunsch algorithm (Needleman and Wunsch 1970), which uses dynamic programming to align and score sequences. The score of ScanMatch is normalized to [0, 1] and is independent of the sequence length. The best match of two sequences will be given a full score of 1. This metric has the advantage of being robust and objective.

Each kind of the metrics introduced above focuses on an identical aspect. sAUC measures how estimated fixations match the ground truth of fixation density. Hausdorff distance and mean minimal distance compare two scanpaths from the perspective of subsequences. ScanMatch gives an overall comparison of the two scanpaths. Therefore, the results in the experiments can be objectively evaluated and analyzed by these three kinds of metrics.

Datasets

We use the public eye tracking datasets of Bruce and Judd (Bruce and Tsotsos 2005; Judd et al. 2009) to evaluate the performance of our model. In the dataset of Bruce (Bruce and Tsotsos 2005), there are 120 natural images with eye tracking data as ground truth. The scanpaths of 20 users are recorded for each image. The length of the sequences ranges from 3 to 8. In the dataset of Judd et al. (2009), there are 1003 images of various types, including natural images, portraits and psychology patterns. The scanpaths of 15 users are recorded for each image. The length of the sequences ranges from 6 to 12.

Analysis and discussions

Experiments are conducted in two datasets: eye tracking datasets of Bruce and Tsotsos (2005), Judd et al. (2009). Our model is compared with four state-of-the-art approaches: STB (Walther and Koch 2006), SHSSNI (Wang et al. 2011), DMSG (Engbert et al. 2015), and SMFC (Le Meur and Liu 2015) all of which have been introduced in section of Introduction. For every model, 10 scanpaths for each image are generated to perform the evaluation. Each generated scanpath consists of 8 fixations.

Since DMSG calculates scanpaths based on existing saliency maps, here we use STB saliency maps. In order to validate the effect of the proposed framework, three cases of our methods are studied. In Case A, represented as CA, the calculation of PQFT saliency map is replaced by STB in our framework. This is designed to inspect if the proposed framework is more effective than the reference methods STB and DMSG when combined with the same saliency computational model. In Case B, represented as CB, the saliency map is computed without using foveated images. And in Case C, represented as CC, the saliency map is computed without the saccadic biases. CB and CC are conducted to investigate the effects of the foveated image and saccadic bias on the results, respectively.

Tables 2 and 3 show the scores of 4 mentioned similarity metrics. The scores in Tables 2 and 3 are the average of results of the ten scanpaths. When we calculate Hausdorff distance and mean minimal distance, subsequences of different lengths, represented by $k$, are used. As listed in Tables 2 and 3, in terms of every mentioned similarity metric, the proposed model outperforms other approaches, especially in the dataset of Bruce. Given that the dataset of Judd contains not only natural images but also portrait images and psychology patterns, the eye tracking data may be influenced by high-level factors. This may be the reason why the proposed model does not perform quite well in one or two metrics.

Table 2 Scores of 4 similarity metrics for 8 methods in the dataset of Bruce

Full size table

Table 3 Scores of 4 similarity metrics for 8 methods in the dataset of Judd

Full size table

SHSSNI model obtained a relatively good result among all the competitors, but it relies on ICA filter responses and runs quite slow. DMSG model is relatively simple but depends much on the saliency map used, which may lead to poor performance. About SMFC, we think it is a remarkable work using the saccadic biases of the datasets. But in this model, the dynamic saliency maps have not been sufficiently considered. Thus, we made some improvements to deal with its shortcomings. Although the framework of SMFC and ours seem similar, our work focuses on the biologically plausible foveated images, which should be the most important difference. The results show that our model works better. Our model takes the saliency change of gaze shifts into consideration, which is very essential for obtaining good results. The effect of biologically plausible foveated image is also proved in the experiment of CB. Furthermore, the saliency map computed by GBVS is not as accurate as the saliency map computed by PQFT, which was proposed by our research group in the reference (Guo et al. 2008).

It is known that the saliency map computed by PQFT method works better than that of STB model. The comparisons between CA and STB or DMSG prove the advantage of our framework. Even using the same STB calculations, our method still outperforms the others.

The results of CB and CC show the effects of two important components, the foveated image and the saccadic bias. The sAUC scores of CB and the proposed model are quite close. However, judging from the scores of Hausdorff distance, mean minimal distance and ScanMatch, we can see that the proposed model with the foveated images works better than that without the foveated images. The results of CC are quite close to those of the proposed model. Compared with the foveated image, the component of the saccadic bias just contributes to a small improvement for the total proposed model. It can be seen that the saliency computed on foveated images plays an important role in the whole model.

Examples of the estimated results are shown in Fig. 3a–d, and their corresponding real eye tracking data are in Fig. 3e–h. The results shown in Fig. 3a–c are close to the ground truth scanpaths. Most of the estimated results give such good performance. However, there are some exceptional cases as shown in Fig. 3d. In images such as Fig. 3d, several saliency regions are scattered. The distances between these regions are relatively large. Due to the tendency of short-distance gaze shift, there is a small chance that candidates far from the current fixation are generated. Thus, the estimated scanpaths may be hard to cross the scattered saliency regions. Research on how to solve these problems will be our future work.

In summary, our model yields higher accuracy than the existing methods being compared with these on two datasets under the evaluation of objective measures. We have also evaluated the contributions of the whole framework as well as each part of it. The results prove that using dynamical saliency maps is critical for the scanpath estimation.

Conclusions

In this paper, a bio-inspired fixation and scanpath estimation method has been proposed to deal with the problems of inaccuracy and neglect of the saliency changes in gaze shifts of existing models. In the proposed model, we take into consideration that the saliency maps change with fixation locations. The probability maps based on the foveated image saliency are used, with the saccadic biases of gaze shifts and the IoR mechanism as two additional factors. The most appropriate candidate is chosen from a series of candidates generated through randomly sampling in the integrated probability map as the next fixation. Experimental results have shown that the estimated scanpaths are in accordance with the distribution of real data in most cases. Under the evaluation of several comprehensive measures, the proposed method has shown advantages over the existing models using the two famous datasets.

References

Anderson N, Anderson F, Kingstone A, Bischof W (2015) A comparison of scanpath comparison methods. Behav Res Methods 47(4):1377–1392
Article PubMed Google Scholar
Bays P, Husain M (2012) Active inhibition and memory promote exploration and search of natural scenes. J Vis 12(8):8
Article PubMed PubMed Central Google Scholar
Bledowski C, Rahm B, Rowe J (2009) What “works” in working memory? separate systems for selection and updating of critical information. J Neurosci 29(43):13735–13741
Article CAS PubMed PubMed Central Google Scholar
Borji A, Itti L (2013) State-of-the-art in visual attention modeling. IEEE Trans Pattern Anal Mach Intell 35(1):185–207
Article PubMed Google Scholar
Borji A, Sihite D, Itti L (2013) Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study. IEEE Trans Image Process 22(1):55–69
Article PubMed Google Scholar
Botev Z, Grotowski J, Kroese D (2010) Kernel density estimation via diffusion. Ann Stat 38(5):2916–2957
Article Google Scholar
Bruce NDB, Tsotsos JK (2005) Saliency based on information maximization. In: Weiss Y, Cholkopf B, Platt IC (eds) Advances in neural information processing systems. MIT Press, Cambridge
Google Scholar
Cristino F, Mathôt S, Theeuwes J, Gilchrist I (2010) ScanMatch: a novel method for comparing fixation sequences. Behav Res Methods 42(3):692–700
Article PubMed Google Scholar
Da Silva MP, Courboulay V (2013) Implementation and evaluation of a computational model of attention for computer vision. Image Process Concepts Methodol Tools Appl 422:273–306
Google Scholar
Engbert R, Trukenbrod H, Barthelme S, Wichmann F (2015) Spatial statistics and attentional dynamics in scene viewing. J Vis 15(1):14
Article PubMed Google Scholar
Geisler WS, Perry JS (2002) Real-time simulation of arbitrary visual fields. In: Proceedings of ACM symposium on eye tracking res application, pp 83–87
Guo C, Ma Q, Zhang L (2008) Spatio-temporal saliency detection using phase spectrum of quaternion Fourier transform. In: Proceedings of IEEE conference computer vision and pattern recognition, pp 1–8
Harel J, Koch C, Perona P (2007) Graph-based visual saliency. In: Platt IC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems. MIT Press, Cambridge, pp 545–552
Google Scholar
Hou X, Zhang L (2007) Saliency detection: A spectral residual approach. In: Proc IEEE conference on computer vision and pattern recognition, pp 1–8
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259
Article Google Scholar
Judd T, Ehinger K, Durand F, Torralba A (2009) Learning to predict where humans look. In: Proceedings of international conference on computer vision, pp 1–7
Larson A, Loschky L (2009) The contributions of central versus peripheral vision to scene gist recognition. J Vis 9(10):6
Article PubMed Google Scholar
Le Meur O, Coutrot A (2016) Introducing context-dependent and spatially-variant viewing biases in saccadic models. Vis Res 121:72–84
Article PubMed Google Scholar
Le Meur O, Liu Z (2015) Saccadic model of eye movements for free-viewing condition. Vis Res 116:152–164
Article PubMed Google Scholar
Liu H, Xu D, Huang Q, Li W, Xu M, Lin S (2013) Semantically based human scanpath estimation with HMMs. In: Proceedings of international conference on computer vision, pp 3232–3239
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Article CAS PubMed Google Scholar
Samuel AG, Kat D (2003) Inhibition of return: a graphical meta-analysis of its time course and an empirical test of its temporal and spatial properties. Psychon Bull Rev 10(4):897–906
Article PubMed Google Scholar
Sun X, Yao H, Ji R, Liu X (2014) Toward statistical modeling of saccadic eye-movement and visual saliency. IEEE Trans Image Process 23(11):4649–4662
Article PubMed Google Scholar
Tatler BW, Vincent BT (2008) Systematic tendencies in scene viewing. J Eye Mov Res 2(2):1–18
Google Scholar
Treisman AM, Gelade G (1980) A feature-integration theory of attention. Cogn Psychol 12(1):97–136
Article CAS PubMed Google Scholar
Walther D, Koch C (2006) Modeling attention to salient proto-objects. Neural Netw 19(9):1395–1407
Article PubMed Google Scholar
Wang W, Chen C, Wang Y, Jiang T, Fang F, Yao Y (2011) Simulating human saccadic scanpaths on natural images. In: Proceedings of computer vision and pattern recognition, pp 441–448

Download references

Acknowledgments

Funding was provided by National Natural Science Foundation of China (Grant No. 61572133).

Author information

Authors and Affiliations

Key Laboratory for Information Science of Electromagnetic Waves (MoE), Fudan University, Shanghai, 200433, China
Yixiu Wang, Bin Wang, Xiaofeng Wu & Liming Zhang
Research Center of Smart Networks and Systems, School of Information Science and Technology, Fudan University, Shanghai, 200433, China
Yixiu Wang, Bin Wang, Xiaofeng Wu & Liming Zhang

Authors

Yixiu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Liming Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Wang.

Additional information

Handling editor: Anna Belardinelli (University of Tübingen); Reviewers: Christian Balkenius (Lund University), Matei Mancas (University of Mons), Olivier Le Meur (University of Rennes).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Wang, B., Wu, X. et al. Scanpath estimation based on foveated image saliency. Cogn Process 18, 87–95 (2017). https://doi.org/10.1007/s10339-016-0781-6

Download citation

Received: 22 December 2015
Accepted: 06 October 2016
Published: 14 October 2016
Issue Date: February 2017
DOI: https://doi.org/10.1007/s10339-016-0781-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Scanpath estimation based on foveated image saliency

Abstract

Introduction

Related works

Proposed method

Foveated image saliency

Saccadic biases and inhibition of return

Strategy for choosing the next fixation candidates

Implementation

Results

Similarity metrics

Datasets

Analysis and discussions

Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation