Introduction

A vast amount of visual data enters the human eyes every second. It is far beyond the ability of human brains to process such data in real time. The human visual system automatically extracts important information from a great deal of visual inputs for further processing in the cerebral cortex, a process known as selective attention. Attention shift, a selective attention behavior, has become a research hotspot. Actually, a scanpath is made of fixations and gaze shifts (saccades). Fixations are gaze locations, which indicate the attractive regions in the scene. Scanpaths are the sequences of these locations, which mainly reflect the visit order of these attractive regions. Research on shifts of visual attention focuses on the regions of the scene that draws people’s attention and uncovers how people select these regions.

Within the past 2 decades, various computational models of visual attention have been proposed. Among all kinds of visual attention models, space-based models (Borji and Itti 2013) focus on the density and the sequence of fixations. This research has helped to explain the biopsychological mechanisms of visual attention behaviors in humans and made large contributions to modern applications such as Web site advertisement design and robot vision.

Based on different objects of research, space-based models can be divided into two kinds: fixation density prediction and saccadic scanpath estimation models.

The aforementioned fixation density prediction models calculate image saliency statically. Based on the feature-integration theory of attention (Treisman and Gelade 1980), Itti et al. (1998) first proposed a biologically plausible computational saliency model, which simulates the working process of the early visual areas. Motivated by Itti’s model, a graph-based visual saliency (GBVS) model was proposed (Harel et al. 2007). In this model, saliency is defined as the stationary distribution of a fully connected graph. However, both classical models mentioned above have very high computational complexity. Therefore, a faster model using spectral residual was put forward (Hou and Zhang 2007). This approach suggests that the information of image saliency is mainly contained in residual amplitude spectrum. However, many other researchers emphasize the importance of phase spectrum. Guo et al. (2008) proposed a model using the phase spectrum of quaternion Fourier transform (PQFT), which is even faster and more accurate than Hou and Zhang’s method in local saliency detection.

Saccadic scanpath estimation models can predict both the density and the order of the fixations. Yet, the research on gaze shifts has not been systemically and fully developed so far. The extended method of Itti’s static model (Saliency Toolbox, STB, Walther and Koch 2006) uses the winner-takes-all and the inhibition of return (IoR) mechanism to simulate scanpaths. In this model, the next fixation point is chosen based on the maximal value in the saliency map with the current fixation region inhibited. The whole scanpath is computed using one static saliency map, disregarding the changes of saliency caused by gaze shifts. In another model (Da Silva and Courboulay 2013) based on Itti’s, the predator–prey method is used to simulate the dynamic competition between different stimuli. A feedback criterion is also added for generating fixations in scene exploration. However, this model depends on too many parameters. A method based on the principle of information maximization was proposed (Wang et al. 2011), making use of independent component analysis (ICA) filters while selecting the next fixation by the maximal residual information between reference and filter response. It relies on ICA filter responses that, however, require large numbers of natural image blocks in model training, increasing the computational burden. In a recent paper (Engbert et al. 2015), a simple saccadic saliency strategy was presented. It reduces the computational time by using grids to simplify the image. In this model, real eye tracking data are used to verify the statistical structure of eye movements. Strictly speaking, it is a saccadic strategy based on existing saliency maps. A saccadic model for free-viewing condition and its advanced model (Le Meur and Liu 2015; Le Meur and Coutrot 2016) were also proposed. The two models emphasize the oculomotor biases of real eye tracking data. Other more complex models have been reported; including one that uses ICA to find the location of the maximal super-Gaussian components measured by the kurtosis function as the next fixation (Sun et al. 2014), as well as another using trained hidden Markov model to obtain the next fixation (Liu et al. 2013).

Existing methods are limited to using only one static saliency map and yield unsatisfactory accuracy. To tackle the problems, we propose an effective bio-inspired scanpath estimation method, which gives sufficient consideration to the dynamic influence of fovea on saliency. The bias of gaze shifts (Le Meur and Liu 2015) and the mechanism of IoR in short-term memory are also considered in the proposed model. By integrating these factors, the probability map of gaze shift can be acquired for the generation of candidates of the next fixation. Besides some traditional similarity criteria, several novel and objective criteria for comparing fixation sequences are introduced in the experiments. Experimental results show that under the evaluation of such measures, our method performs more accurately in several datasets than many existing models do.

Related works

Researchers have been investigating the biases of saccades for a long time and found many characteristics of saccades (Tatler and Vincent 2008; Bays and Husain 2012). Among all these researchers, Le Meur and Liu (2015), Le Meur and Coutrot (2016) conducted relatively complete experiments. The former model (Le Meur and Liu 2015) contributes a lot in computing the joint distribution of the distance and direction of gaze shifts, using the kernel density estimation (KDE) toolbox (Botev et al. 2010). Let the current fixation be \(\varvec{x}_{t}\) and the next one \(\varvec{x}_{t + 1}\). The norm of saccadic vector \(\varvec{x}_{t + 1} - \varvec{x}_{t}\) is defined as shift distance (visual angle) \(d\), and the angle from the abscissa is defined as shift angle \(\phi\). The estimated distribution is represented as

$$p_{\text{KDE}} (d,\phi ){ = }\frac{1}{n}\sum\limits_{i = 1} {K_{h} (d - d_{i} ,\phi - \phi_{i} )} ,$$
(1)

where \(n\) is the total number of samples \((d_{i} ,\phi_{i} )\), and \(K_{h}\) is a Gaussian kernel. Besides the distribution of gaze shifts, this model also adopted saliency maps computed by GBVS model (Harel et al. 2007) and the IoR mechanism, which have been used in other models.

The more recent model (Le Meur and Coutrot 2016) investigated the saccadic biases of different categories. Since the former model simply used the GBVS model (Harel et al. 2007) to compute saliency maps, which is relatively slow and inaccurate, this model used a more accurate saliency model to achieve better results. Yet, the dynamic influence of the fovea has not been considered. Only one static saliency map is used in the model when it estimates the scanpath.

In our model, the described saccadic bias is also adopted. The difference from Le Meur’s model is that the dynamic saliency maps varying with foveated locations are used to obtain more accurate results. Furthermore, several objective criteria for comparing estimated saccades are introduced in order to conduct a fairer measure.

Proposed method

The fovea, a tiny region located in the center of the retina with the highest visual resolution, is responsible for detailed central vision in human observing activities. Visual resolution drops with retinal eccentricity, so the area far away from center has much lower resolution (Larson and Loschky 2009). In computational models, it is more realistic to calculate the saliency of the image on the retina rather than from the original image. When people observe a scene, the object corresponding to the current fixation will fall on the fovea, enabling the human brain to obtain detailed information. Because a region far away from the fovea has low visual resolution, the retinal imaging is different from the original scene. Many existing methods adopt only one static saliency map to predict scanpath without considering the dynamic properties. Thus, we propose a method to estimate scanpaths using foveated image saliency. Foveated image can simulate the real imaging on retina when people gaze at one point. This factor plays an important role in the prediction of the next fixation.

Moreover, biases of the distance and the direction of gaze shifts have been investigated in recent researches (Tatler and Vincent 2008; Bays and Husain 2012; Le Meur and Liu 2015). The saccadic biases in the model of Le Meur and Liu give a new perspective of adopting this element in estimating scanpaths. On two famous eye fixation datasets, saccadic biases are analyzed and modeled using kernel density estimation. The bias of gaze shifts is used as an additional factor in our model.

Short-term memory has been proved to be influential on gaze shifts (Bledowski et al. 2009). It can store and integrate the information of past eye movements, so that the past fixations will not be accessed again after a short time. This mechanism exists to prevent the human eyes from revisiting the same location, with its influence fading over time. This mechanism named IoR has been used in several existing models and is also employed in our model.

The probability maps contributed by these factors mentioned above are used to acquire the final probability map for gaze shifts. We generated several candidate points based on the probability map and selected the highest saliency gain location as the next fixation. The detailed approach of every factor and the procedure of our model will be described in the following subsections.

Foveated image saliency

In order to simulate the property of retina, we introduce the foveated image (Geisler and Perry 2002) as the first factor. Multi-resolution pyramid method is used to generate foveated images. The original image is denoted as \(\varvec{I}_{\text{origin}}\). We convolve \(\varvec{I}_{\text{origin}}\) with multi-scale kernels \(\varvec{G}_{l}\), where \(l\) represents the index of each layer. The resolution map in each layer is calculated as follows:

$$\varvec{P}_{l} = \left\{ {\begin{array}{*{20}l} {\varvec{I}_{\text{origin}} , \, } \hfill & {l = 0 \, } \hfill \\ {\varvec{G}_{l} * \varvec{I}_{\text{origin}} ,} \hfill & {l = 1,2, \ldots ,L} \hfill \\ \end{array} } \right.$$
(2)

where * denotes the convolution operation, and \(L\) represents the total number of the layers in the multi-resolution pyramid. Let \(\varvec{x}^{F}\) be the current fixation location. We use \(e\) to denote the distance between any point \(\varvec{x}\) and the fixation \(\varvec{x}^{F}\). \(\sigma_{l}\) corresponds to the radius of each kernel. Let \(\varvec{W}_{l}\) be the weight matrix of each layer \(l\), each element of \(\varvec{W}_{l}\) is defined as follows:

$$w_{l} (\varvec{x}) = \left\{ {\begin{array}{*{20}l} {\exp \left( { - \frac{{e^{2} }}{{2\sigma_{l}^{2} }}} \right),} \hfill & {l = 0} \hfill \\ {\exp \left( { - \frac{{e^{2} }}{{2\sigma_{l}^{2} }}} \right) - \exp \left( { - \frac{{e^{2} }}{{2\sigma_{l - 1}^{2} }}} \right),} \hfill & {l = 1,2, \ldots ,L - 1} \hfill \\ {1 - \exp \left( { - \frac{{e^{2} }}{{2\sigma_{l - 1}^{2} }}} \right),} \hfill & {l = L} \hfill \\ \end{array} } \right.$$
(3)

In the final foveated image \(\varvec{I}_{\text{foveated}}\), the value of every pixel is computed as the weighted sum of the different layers in pyramid (as shown in Fig. 1b). The pixel value corresponding to location \(\varvec{x}\) in the foveated image \(\varvec{I}_{\text{foveated}}\) can be described as:

$$i_{\text{foveated}} (\varvec{x}) = \sum\limits_{l = 0}^{L} {w_{l} (\varvec{x}) \times p_{l} (\varvec{x})} .$$
(4)

where \(\times\) is used to denote multiplication. The calculated foveated image (as shown in Fig. 1c) simulates the real imaging on retina for the fixation \(\varvec{x}^{F}\). The marked point in Fig. 1c represents the current fixation.

Fig. 1
figure 1

Generation of foveated images, a original image, b diagram of calculating the foveated image, c foveated image

In theory, any static saliency model can be used to acquire the saliency maps of the foveated images. In order to reduce computational complexity, the simple and fast PQFT method (Guo et al. 2008) is chosen in our model. The quaternion representation of the input image in Lab color space is denoted as \(\varvec{Q}\). Then, a quaternion Fourier transformation is performed. The phase angles are maintained, while the moduli for all frequency components are set to unity. Then, the image is recovered by inverse quaternion Fourier transformation to get \(\varvec{Q}'\). \(\varvec{P}_{\text{FS}}\) is acquired by:

$$\varvec{P}_{\text{FS}} = G\left( {\left\| {\varvec{Q}^{{\prime }} } \right\|} \right),$$
(5)

where \(G\left( \cdot \right)\) denotes the Gaussian filter.

Using fast Fourier transform, PQFT method is very practical and effective in finding small saliency objects. Following the steps above, the first factor \(\varvec{P}_{\text{FS}}\) can be acquired.

Saccadic biases and inhibition of return

Existing researches (Tatler and Vincent 2008; Bays and Husain 2012; Le Meur and Liu 2015; Le Meur and Coutrot 2016) have found that gaze shifts are not irregular. There exist some distance and direction biases for these fixation shifts. Through the analysis of eye tracking datasets, several characteristics of saccades can be obtained.

Le Meur and Liu (2015) used the kernel density estimation (KDE) toolbox (Botev et al. 2010) to compute the probability distribution of saccadic biases. The estimated distribution is represented as \(p_{\text{KDE}} (d,\phi )\). Based on their research results, for given a current fixation \(\varvec{x}_{t}^{F}\), the element corresponding to location \(\varvec{x}\) of the probability distribution map \(\varvec{P}_{\text{SB}}\) is shown as:

$$p_{\text{SB}} (\varvec{x}) = p_{\text{KDE}} (\text{norm}(\varvec{x}_{t + 1} - \varvec{x}_{t} ),{\text{angle}}(\varvec{x}_{t + 1} - \varvec{x}_{t} )),$$
(6)

where \(\text{norm}( \cdot )\) and \({\text{angle}}( \cdot )\) represent the operation of the norm and the angle, respectively. The above calculation gives the probability distribution of saccadic biases \(\varvec{P}_{\text{SB}}\), which is used as an additional factor in our model.

The IoR mechanism is taken into consideration in choosing the next fixation point within our model. The past fixations have little chance of being visited in a short time. A linear forgetting transition probability is used to simulate the IoR mechanism. The probability map is recovered linearly when fixations are generated one by one. Our model computes the linear forgetting probability. It takes into account the past \(T\) fixations. And the inhibition effect of the previous fixations will gradually fade. Let \(\varvec{x}_{t - \tau }^{F}\) denote fixation location of the past \(\tau\) th moment, and \(\varvec{x}\) be the coordinate of arbitrary location. Let \(\sigma_{\text{M}}\) be the standard deviation of two-dimensional Gaussian distribution. Each element of the forgetting area \(\varvec{M}_{\tau }\) of the past fixation \(\varvec{x}_{t - \tau }^{F}\) and its neighborhood, which should be subtracted from the probability map, is shown as:

$$m_{\tau } (\varvec{x}) = \exp \left( { - \frac{{\left\| {\varvec{x} - \varvec{x}_{t - \tau }^{F} } \right\|^{2} }}{{2\sigma_{\text{M}}^{2} }}} \right) .$$
(7)

Let \(\{ \varvec{x}_{i}^{F} \}_{i = t}^{t - (T - 1)}\) denote the locations of previous fixations. \(\varvec{1}\) is the matrix in which every element is one. The probability distribution at the next fixation \(\varvec{P}_{\text{IOR}}\) can be formulated as follows:

$$\varvec{P}_{\text{IOR}} = N\left| {\varvec{1} - \sum\limits_{\tau = 0}^{T - 1} {\frac{T - \tau }{T}} \varvec{M}_{\tau } } \right|,$$
(8)

where \(N\left| \cdot \right|\) means clipping and normalization. As time passes, the influence of past fixations gradually fades. When we compute the IoR map for the next fixation, the current forgetting area \(\varvec{M}_{\tau }\) is removed from the map, and the past forgetting areas, which were removed before, are linearly restored.

Strategy for choosing the next fixation candidates

According to the above factors, three probability maps have been computed. Assuming that each of them is independent, the transition probability can be computed as follows:

$$p(\varvec{x}) = p_{\text{FS}} (\varvec{x}) \times p_{\text{SB}} (\varvec{x}) \times p_{\text{IOR}} (\varvec{x}) .$$
(9)

There are two strategies in the selection of the next fixation according to the probability map \(\varvec{P}\). One is choosing the point with the highest value in \(\varvec{P}\). The other is taking one candidate point with the highest saliency gain with respect to previous fixation among several candidate points generated based on random sampling (Le Meur and Liu 2015). Due to the randomness of the human visual system, the transition probability can only represent the trend in repeated tests, but cannot decide the next fixation in one test. When using the former method, the calculated scanpaths will tend to be periodic in a number of points that have a local maximum. Therefore, the latter method is chosen. Based on the final probability map, 10 candidate points are randomly picked. The candidate with the highest value of \(p_{\text{FS}} (\varvec{x})\) is chosen as the next fixation.

Implementation

The implementation details and parameter selection in the experiments are described in this subsection. Steps of the proposed algorithm are described and listed in Table 1.

Table 1 Proposed algorithm

Figure 2 shows the framework of the proposed model and gives the probability maps of each step. In the final probability map in Fig. 2, the current fixation (represented as triangle), candidates (represented as square), and the next fixation (represented as circle) are marked. It can be seen that the combination of the foveated saliency and other two factors is effective in choosing the next fixation.

Fig. 2
figure 2

Framework of the proposed model based on foveated image saliency

A 4-layer pyramid is used in the computation of the foveated image. Excessive layers are avoided since they are not necessary for obtaining the saliency map and will lead to increased amount of computation. KDE toolbox proposed by Botev et al. (2010) is used in computing saccadic biases. Five previous fixations are taken into account in the IoR mechanism. It is because one fixation duration lasts about 300 ms and the IoR effect lasts about 1.5 s to 3 s (Samuel and Kat 2003). The length of the real eye tracking data in the datasets is about 3–12 points, so it is unnecessary to use more than five previous fixations in our experiments. The area parameter of IoR \(\sigma_{\text{M} }\) refers to that in the STB model (Walther and Koch 2006). The number of the candidates affects the randomness of the estimated results. When the candidate number increases, the results of the second strategy are close to those of the first strategy. As the candidate number approaches one, the randomness increases and the candidates may fail to hold the salient points. Thus, we fix the candidate number at 10, which gives the best performance across the experiments. For every input image, one generated scanpath consists of 8 fixations, which are represented by their locations \((x,y)\).

Results

Similarity metrics

In our experiments, both the locations and the orders of fixations are taken into account in choosing similarity metrics.

If we regard scanpaths as the set of points without considering the order, sAUC (shuffled area under ROC curve) (Borji et al. 2013) can be used to measure the difference between estimated results and ground truth data. sAUC is proposed as an improved metric of uniform AUC (area under ROC curve). The higher sAUC score, the better result.

When we take into account the order of the fixations, one sequence is estimated for each image. The ground truth data of eye tracking are the scanpaths of several users viewing one image. Therefore, we evaluate the estimated sequence with that given by each of the viewers and average all the evaluation scores to yield an overall score of this image.

Hausdorff distance and mean minimal distance (Wang et al. 2011) are the metrics of measuring the similarity degree of two sets. They can also be used in comparing sequences. The sequence \(X = (\varvec{x}_{1} ,\varvec{x}_{2} , \ldots ,\varvec{x}_{n} )\) (or \(Y = (\varvec{y}_{1} ,\varvec{y}_{2} , \ldots ,\varvec{y}_{m} )\)) is divided into pieces of length \(k\). \(C_{x}^{k} (t) = (\varvec{x}_{t} ,\varvec{x}_{t + 1} , \ldots ,\varvec{x}_{t + k - 1} )\) is defined as the \(k\)-dimensional vector which starts from the \(t\) th fixation of sequence \(X\). The model space \(\{ C_{x}^{k} (t)\}_{t} \subseteq R^{k}\) can be obtained by varying \(t\). Similarly, we can get the model space of sequence \(Y\). In Eqs. (10) and (11), \(d_{\text{H}}^{k}\) represents Hausdorff distance, and \(d_{\text{MM}}^{k}\) represents mean minimal distance:

$$d_{\text{H}}^{k} = \mathop {\hbox{max} }\limits_{t} \left\{ {\mathop {\hbox{min} }\limits_{\tau } \left\{ {\left\| {C_{x}^{k} (t) - C_{y}^{k} (\tau )} \right\|} \right\}} \right\}/k,$$
(10)
$$d_{\text{MM}}^{k} = E_{t} \left\{ {\mathop {\hbox{min} }\limits_{\tau } \left\{ {\left\| {C_{x}^{k} (t) - C_{y}^{k} (\tau )} \right\|} \right\}} \right\}.$$
(11)

The former computes the maximal value of all the minimal distances between two sets, while the latter computes the mean value. A smaller distance represents a better prediction.

The metric of ScanMatch (Cristino et al. 2010) is recommended by Anderson et al. (2015), who gave an overview and comparison of existing metrics of measuring the sequence similarity. In our experiment, we focus on the position of fixations, the amplitude and the direction of saccade, as well as the order of the fixations. Therefore, based on the recommendation of the mentioned research (Anderson et al. 2015), ScanMatch is chosen as the similarity metric of scanpaths. ScanMatch is a method for comparing fixation sequences based on the Needleman–Wunsch algorithm (Needleman and Wunsch 1970), which uses dynamic programming to align and score sequences. The score of ScanMatch is normalized to [0, 1] and is independent of the sequence length. The best match of two sequences will be given a full score of 1. This metric has the advantage of being robust and objective.

Each kind of the metrics introduced above focuses on an identical aspect. sAUC measures how estimated fixations match the ground truth of fixation density. Hausdorff distance and mean minimal distance compare two scanpaths from the perspective of subsequences. ScanMatch gives an overall comparison of the two scanpaths. Therefore, the results in the experiments can be objectively evaluated and analyzed by these three kinds of metrics.

Datasets

We use the public eye tracking datasets of Bruce and Judd (Bruce and Tsotsos 2005; Judd et al. 2009) to evaluate the performance of our model. In the dataset of Bruce (Bruce and Tsotsos 2005), there are 120 natural images with eye tracking data as ground truth. The scanpaths of 20 users are recorded for each image. The length of the sequences ranges from 3 to 8. In the dataset of Judd et al. (2009), there are 1003 images of various types, including natural images, portraits and psychology patterns. The scanpaths of 15 users are recorded for each image. The length of the sequences ranges from 6 to 12.

Analysis and discussions

Experiments are conducted in two datasets: eye tracking datasets of Bruce and Tsotsos (2005), Judd et al. (2009). Our model is compared with four state-of-the-art approaches: STB (Walther and Koch 2006), SHSSNI (Wang et al. 2011), DMSG (Engbert et al. 2015), and SMFC (Le Meur and Liu 2015) all of which have been introduced in section of Introduction. For every model, 10 scanpaths for each image are generated to perform the evaluation. Each generated scanpath consists of 8 fixations.

Since DMSG calculates scanpaths based on existing saliency maps, here we use STB saliency maps. In order to validate the effect of the proposed framework, three cases of our methods are studied. In Case A, represented as CA, the calculation of PQFT saliency map is replaced by STB in our framework. This is designed to inspect if the proposed framework is more effective than the reference methods STB and DMSG when combined with the same saliency computational model. In Case B, represented as CB, the saliency map is computed without using foveated images. And in Case C, represented as CC, the saliency map is computed without the saccadic biases. CB and CC are conducted to investigate the effects of the foveated image and saccadic bias on the results, respectively.

Tables 2 and 3 show the scores of 4 mentioned similarity metrics. The scores in Tables 2 and 3 are the average of results of the ten scanpaths. When we calculate Hausdorff distance and mean minimal distance, subsequences of different lengths, represented by \(k\), are used. As listed in Tables 2 and 3, in terms of every mentioned similarity metric, the proposed model outperforms other approaches, especially in the dataset of Bruce. Given that the dataset of Judd contains not only natural images but also portrait images and psychology patterns, the eye tracking data may be influenced by high-level factors. This may be the reason why the proposed model does not perform quite well in one or two metrics.

Table 2 Scores of 4 similarity metrics for 8 methods in the dataset of Bruce
Table 3 Scores of 4 similarity metrics for 8 methods in the dataset of Judd

SHSSNI model obtained a relatively good result among all the competitors, but it relies on ICA filter responses and runs quite slow. DMSG model is relatively simple but depends much on the saliency map used, which may lead to poor performance. About SMFC, we think it is a remarkable work using the saccadic biases of the datasets. But in this model, the dynamic saliency maps have not been sufficiently considered. Thus, we made some improvements to deal with its shortcomings. Although the framework of SMFC and ours seem similar, our work focuses on the biologically plausible foveated images, which should be the most important difference. The results show that our model works better. Our model takes the saliency change of gaze shifts into consideration, which is very essential for obtaining good results. The effect of biologically plausible foveated image is also proved in the experiment of CB. Furthermore, the saliency map computed by GBVS is not as accurate as the saliency map computed by PQFT, which was proposed by our research group in the reference (Guo et al. 2008).

It is known that the saliency map computed by PQFT method works better than that of STB model. The comparisons between CA and STB or DMSG prove the advantage of our framework. Even using the same STB calculations, our method still outperforms the others.

The results of CB and CC show the effects of two important components, the foveated image and the saccadic bias. The sAUC scores of CB and the proposed model are quite close. However, judging from the scores of Hausdorff distance, mean minimal distance and ScanMatch, we can see that the proposed model with the foveated images works better than that without the foveated images. The results of CC are quite close to those of the proposed model. Compared with the foveated image, the component of the saccadic bias just contributes to a small improvement for the total proposed model. It can be seen that the saliency computed on foveated images plays an important role in the whole model.

Examples of the estimated results are shown in Fig. 3a–d, and their corresponding real eye tracking data are in Fig. 3e–h. The results shown in Fig. 3a–c are close to the ground truth scanpaths. Most of the estimated results give such good performance. However, there are some exceptional cases as shown in Fig. 3d. In images such as Fig. 3d, several saliency regions are scattered. The distances between these regions are relatively large. Due to the tendency of short-distance gaze shift, there is a small chance that candidates far from the current fixation are generated. Thus, the estimated scanpaths may be hard to cross the scattered saliency regions. Research on how to solve these problems will be our future work.

Fig. 3
figure 3

Comparison between the result of the proposed method and the ground truth, ad estimated scanpath results, eh real eye tracking data corresponding to (a)–(d)

In summary, our model yields higher accuracy than the existing methods being compared with these on two datasets under the evaluation of objective measures. We have also evaluated the contributions of the whole framework as well as each part of it. The results prove that using dynamical saliency maps is critical for the scanpath estimation.

Conclusions

In this paper, a bio-inspired fixation and scanpath estimation method has been proposed to deal with the problems of inaccuracy and neglect of the saliency changes in gaze shifts of existing models. In the proposed model, we take into consideration that the saliency maps change with fixation locations. The probability maps based on the foveated image saliency are used, with the saccadic biases of gaze shifts and the IoR mechanism as two additional factors. The most appropriate candidate is chosen from a series of candidates generated through randomly sampling in the integrated probability map as the next fixation. Experimental results have shown that the estimated scanpaths are in accordance with the distribution of real data in most cases. Under the evaluation of several comprehensive measures, the proposed method has shown advantages over the existing models using the two famous datasets.