Skip to main content

Advertisement

Log in

A soft-computing-based approach to artificial visual attention using human eye-fixation paradigm: toward a human-like skill in robot vision

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Fitting the skills of the natural vision is an appealing perspective for artificial vision systems, especially in robotics applications dealing with visual perception of the complex surrounding environment where robots and humans mutually evolve and/or cooperate, or in a more general way, those prospecting human–robot interaction. Focusing the visual attention dilemma through human eye-fixation paradigm, in this work we propose a model for artificial visual attention combining a statistical foundation of visual saliency and a genetic tuning of the related parameters for robots’ visual perception. The computational issue of our model relies on the one hand on center-surround statistical features’ calculations with a nonlinear fusion of different resulting maps, and on the other hand on an evolutionary tuning of human’s gazing way resulting in emergence of a kind of artificial eye-fixation-based visual attention. Statistical foundation and bottom-up nature of the proposed model provide as well the advantage to make it usable without needing prior information as a comprehensive solid theoretical basement. The eye-fixation paradigm has been considered as a keystone of the human-like gazing attribute, molding the robot’s visual behavior toward the human’s one. The same paradigm, providing MIT1003 and TORONTO image datasets, has served as evaluation benchmark for experimental validation of the proposed system. The reported experimental results show viability of the incorporated genetic tuning process in shoving the conduct of the artificial system toward the human-like gazing mechanism. In addition, a comparison to currently best algorithms used in the aforementioned field is reported through MIT300 dataset. While not being designed for eye-fixation prediction task, the proposed system remains comparable to most of algorithms within this leading group of currently best state-of-the-art algorithms used in the aforementioned field. Moreover, about ten times faster than the currently best state-of-the-art algorithms, the promising execution speed of our approach makes it suitable for an effective implementation fitting requirement for real-time robotics applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Achanta R, Estrada F, Wils P, Susstrunk S (2008) Salient Region Detection and Segmentation. In: Proceedings of international conference on computer vision systems, vol 5008, LNCS, Springer, Berlin/Heidelberg, pp 66–75

  • Achanta R, Hemami S, Estrada F, Susstrunk S (2009) Frequency-tuned Salient Region Detection. In: Proceedings of IEEE international conference on computer vision and pattern recognition, pp 1597–1604

  • Borji A, Tavakoli HR, Sihite DN, Itti L (2013a) Analysis of scores, datasets, and models in visual saliency prediction. In: Proceedings of IEEE ICCV, pp 921–928

  • Borji A, Itti L (2013) State-of-the-art in visual attention modeling. IEEE Trans Pattern Anal Mach Intell 35(1):185–207

    Article  Google Scholar 

  • Borji A, Sihite DN, Itti L (2013) Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study. IEEE Trans Image Process 22(1):55–69

    Article  MathSciNet  MATH  Google Scholar 

  • Bruce NDB, Tsotsos JK (2009) Saliency, attention, and visual search: an information theoretic approach. J Vis 9(3):1–24

    Article  Google Scholar 

  • Chella A, Macaluso I (2009) The perception loop in CiceRobot, a museum guide robot. Neurocomputing 72(4–6):760–766

    Article  Google Scholar 

  • Contreras-Reyes JE, Arellano-Valle RB (2012) Küllback-Leibler divergence measure for multivariate skew-normal distributions. Entropy 14(9):1606–1626

    Article  MathSciNet  MATH  Google Scholar 

  • Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874

    Article  Google Scholar 

  • Cornia M, Baraldi L, Serra G, Cucchiara R (2016a) Predicting human eye fixations via an LSTM-based saliency attentive model. CoRR. arXiv:1611.09571

  • Cornia M., Baraldi L., Serra G., Cucchiara R (2016b) A deep multi-level network for saliency prediction. In Proceedings of international conference on pattern recognition (ICPR)

  • Harel J, Koch C, Perona P (2007) Graph-based visual saliency. Adv Neural Inf Process Syst 19:545–552

    Google Scholar 

  • Hayhoe M, Ballard D (2005) Eye movements in natural behavior. Trends Cogn Sci 9:188–194

    Article  Google Scholar 

  • Holzbach A, Cheng G (2014) A scalable and efficient method for salient region detection using sampled template collation. In: Proceedings of IEEE ICIP, pp 1110–1114

  • Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intel 20:1254–1259

    Article  Google Scholar 

  • Jiang M, Xu J, Zhao Q (2014) Saliency in crowd. In: Proceedings of ECCV, Lecture Notes in Computer Science, vol 8695, pp 17–32

  • Judd T, Ehinger K, Durand F, Torralba A (2009) Learning to predict where humans look. In: Proceedings of IEEE ICCV, pp 2106–2113

  • Judd T, Durand F, Torralba A (2012) A benchmark of computational models of saliency to predict human fixations. MIT Technical Report, http://saliency.mit.edu/

  • Kachurka V, Madani K, Sabourin C, Golovko V (2014) A statistical approach to human-like visual attention and saliency detection for robot vision: application to wildland fires detection. In: Proceedings of ICNNAI 2014, Brest, Byelorussia, June 3–6, CCIS series, vol 440. Springer, pp 124–135

  • Kachurka V, Madani K, Sabourin C, Golovko V (2015) From human eye fixation to human-like autonomous artificial vision. In: Proceedings of the international work-conference on artificial neural networks (IWANN 2015), LNCS series, vol 9094, Part I. Springer, pp 171–184

  • Kadir T, Brady M (2001) Saliency, scale and image description. J Vis 45(2):83–105

    MATH  Google Scholar 

  • Kienzle W, Franz MO, Schölkopf B, Wichmann FA (2009) Center-surround patterns emerge as optimal predictors for human saccade targets. J Vis 9:1–15

    Article  Google Scholar 

  • Koehler K, Guo F, Zhang S, Eckstein MP (2014) What do saliency models predict? J Vis 14(3):1–27

    Article  Google Scholar 

  • Kümmerer M, Wallis TS, Bethge M (2016) DeepGaze II: reading fixations from deep features trained on object recognition. arXiv:1610.01563

  • Kümmerer M, Theis L, Bethge M, DeepGaze I (2015) Boosting saliency prediction with feature maps trained on ImageNet. In: Proceedings of international conference on learning representations (ICLR)

  • Liang Z, Chi Z, Fu H, Feng D (2012) Salient object detection using content-sensitive hypergraph representation and partitioning. Pattern Rec 45(11):3886–3901

    Article  Google Scholar 

  • Liu T, Yuan Z, Sun J, Wang J, Zheng N, Tang X, Shum H-Y (2001) Learning to detect a salient object. IEEE Trans Pattern Anal Mach Intell 33(2):353–367

    Google Scholar 

  • Liu T, Sun J, Zheng N. N, Shum HY (2007) Learning to detect a salient object. In: Proceedings of IEEE ICCV, pp 1–8

  • Moreno R, Ramik DM, Graña M, Madani K (2012) Image segmentation on the spherical coordinate representation of the RGB color space. IET Image Proc 6(9):1275–1283

    Article  Google Scholar 

  • Navalpakkam V, Itti L (2006) An integrated model of top-down and bottom-up attention for optimizing detection speed. Proc of IEEE CVPR II:2049–2056

    Google Scholar 

  • Pan J, Canton C, McGuinness K, O’Connor NE, Torres J (2017) SalGAN: visual saliency prediction with generative adversarial networks. In: Proceedings of scene understanding workshop (SUNw), CVPR 2017, July 21 to July 26, 2017, Honolulu, Hawaii, USA

  • Panerai F, Metta G, Sandini G (2002) Learning visual stabilization reflexes in robots with moving eyes. Neurocomputing 48(1–4):323–337

    Article  MATH  Google Scholar 

  • Peters RJ, Iyer A, Itti L, Koch C (2005) Components of bottom-up gaze allocation in natural images. Vis Res 45(18):2397–2416

    Article  Google Scholar 

  • Rajashekar U, Vander Linde I, Bovik AC, Cormack LK (2008) GAFFE: a Gaze- attentive fixation finding engine. IEEE Trans Image Process 17(4):564–573

    Article  MathSciNet  Google Scholar 

  • Ramik DM (2012) Contribution to complex visual information processing and autonomous knowledge extraction: application to autonomous robotics. Ph.D. dissertation, University Paris-Est, Pub. No. 2012PEST1100

  • Ramik DM, Sabourin C, Madani K (2011) Hybrid salient object extraction approach with automatic estimation of visual attention scale. In: Proceedings of IEEE SITIS, pp 438–445

  • Ramik DM, Sabourin C, Moreno R, Madani K (2014) A machine learning based intelligent vision system for autonomous object detection and recognition. J Appl Intell 40(2):358–375

    Article  Google Scholar 

  • Ramik DM, Madani K, Sabourin C (2015) A soft-computing basis for robots’ cognitive autonomous learning. Soft Comput J 19:2407–2421

    Article  Google Scholar 

  • Riche N, Duvinage M, Mancas M, Gosselin B, Dutoit T (2013) Saliency and human fixations: state-of-the-art and study of comparison metrics. In: Proceedings of IEEE ICCV, pp 1153–1160

  • Riche N, Mancas M, Duvinage M, Mibulumukini M, Gosselin B, Dutoit T (2013) RARE2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis. Signal Process Image Commun J 28(6):642–658

    Article  Google Scholar 

  • Shen Ch, Zhao Q (2014) Webpage saliency. In: Proceedings of ECCV, Lecture Notes in Computer Science, vol 8695, pp 34–46

  • Subramanian R, Katti H, Sebe N, Kankanhalli M, Chua T-S (2010) An eye fixation database for saliency detection in images. In: Proceedings of ECCV, pp 30–43

  • Tatler BW (2007) The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor bases and image feature distributions. J Vis 14:1–17

    Google Scholar 

  • Tavakoli HR, Laaksonen J (2016) Bottom-up fixation prediction using unsupervised hierarchical models. In: Proceedings of ACCV 2016, Workshop on Assistive Vision, LNCS 10116. Spinger, pp 287–302

  • Triesch J, Ballard DH, Hayhoe MM, Sullivan BT (2003) What you see is what you need. J Vis 3:86–94

    Article  Google Scholar 

  • Vig E, Dorr M, Cox D (2014) Large-scale optimization of hierarchical features for saliency prediction in natural images. In: Proceedings of IEEE CVPR, pp 2798–2805

  • Võ ML-H, Smith TJ, Mital PK, Henderson JM (2012) Do the eyes really have it? Dynamic allocation of attention when viewing moving faces? J Vis 12(13):1–14

    Article  Google Scholar 

  • Zhang J, Sclaroff S (2013) Saliency detection: a Boolean map approach. In: Proceedings of IEEE ICCV, pp 153–160

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kurosh Madani.

Ethics declarations

Conflict of interest

Authors declares that they have no conflict of interest.

Ethical standard

Authors declare that the presented work does not involve any direct human participants. Authors also declare that all standard benchmark databases were in accordance with the ethical standards of the institutional and national research committee and with the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards.

Additional information

Communicated by V. Loia.

Appendices

Appendix A-1

Let us suppose the image \(I_{YCC} \), represented by its pixels \(I_{YCC} \left( x \right) \), in YCrCb color space, where \(x\in \mathrm{N}^{2}\) denotes 2D pixel position. Let \(I_Y \left( x \right) \), \(I_{Cr} \left( x \right) \) and \(I_{Cb} \left( x \right) \) be the colors values in channels Y, Cr and Cb, respectively. Similarly, let \(I_{RGB} \left( x \right) \) be the same image in RGB color space and \(I_R \left( x \right) \), \(I_G \left( x \right) \) and \(I_B \left( x \right) \) be its colors values in channels R, G and B, respectively. Finally, let \(\overline{I_Y } \), \(\overline{I_{Cr} } \) and \(\overline{I_{Cb} } \) be median values for each channel throughout the whole image.

As a reminder, we recall that GSM is resulted from nonlinear fusion of two elementary maps denoted \(M_Y \left( x \right) \) and \(M_{CrCb} \left( x \right) \), relating luminance and chromaticity separately (Moreno et al. 2012 and Ramik 2012). Equations (A-1), (A-2) and (A-3) detail the calculation of each elementary map as well as the resulting GSM.

$$\begin{aligned}&M_Y \left( x \right) =\left\| {\overline{I_Y } -I_Y \left( x \right) } \right\| \end{aligned}$$
(A-1)
$$\begin{aligned}&M_{CrCb} \left( x \right) =\sqrt{\left( {\overline{I_{Cr} } -I_{Cr} \left( x \right) } \right) ^{2}+\left( {\overline{I_{Cb} } -I_{Cb} \left( x \right) } \right) ^{2}} \end{aligned}$$
(A-2)
$$\begin{aligned}&M_G \left( x \right) =\frac{1}{1-e^{-C\left( x \right) }} M_{CrCb} \left( x \right) \nonumber \\&\quad +\left( {1-\frac{1}{1-e^{-C\left( x \right) }}} \right) M_Y \left( x \right) \end{aligned}$$
(A-3)

\(C\left( x \right) \) is a coefficient, computed accordingly to the equation (A-4) which depends on saturation of each pixel in RGB color space expressed by equation (A-5).

$$\begin{aligned} C\left( x \right)= & {} 10-0.5 C_C \left( x \right) \end{aligned}$$
(A-4)
$$\begin{aligned} C_C \left( x \right)= & {} Max \left( {I_R \left( x \right) , I_G \left( x \right) , I_B \left( x \right) } \right) \nonumber \\&-Min \left( {I_R \left( x \right) , I_G \left( x \right) , I_B \left( x \right) } \right) \end{aligned}$$
(A-5)

In the same way, based on center-surround histograms’ concept involving statistical properties of two centered windows (over each pixel) sliding alongside whole the image (initially proposed in Liu et al. 2007), LSM is also obtained from a nonlinear fusion of two statistical properties-based maps relating luminance and chromaticity of the image (Moreno et al. 2012 and Ramik 2012). Referring to Fig. 12, let \(P\left( x \right) \) be a sliding window of sizeP, centered over a pixel located at position x, and let \(Q\left( x \right) \) be another surrounding area around the same pixel of size Q, with \(Q-P=p^{2}\). Let consider the “center-histogram” \(H_C \left( x \right) \) as a histogram of pixels’ intensities in window \(P\left( x \right) \) with \(h_C \left( {x , k} \right) \) representing the value of kth bin of this histogram. Let consider the “surrounding-histogram” \(H_S \left( x \right) \) as a histogram of pixels’ intensities in the surrounding window \(Q\left( x \right) \) with \(h_S \left( {x , k} \right) \) representing its kth bin’s value. Let define the “center-surround feature” \(d_{ch} \left( x \right) \) as a sum of differences between normalized center and surrounding histograms over all 256 histograms’ bins computed for each channel “ch” (\(ch\in \left\{ {Y , Cr , Cb} \right\} )\).

Fig. 12
figure 12

Idea of centered windows sliding alongside an image of size nm

Equation (A-6) details the computation of the so-called “center-surround feature,” where \(\left| {H_C \left( x \right) } \right| \) and \(\left| {H_S \left( x \right) } \right| \) represent sums over all histogram bins. Concerning pixels near the image’s borders, the areas \(P\left( x \right) \) and \(Q\left( x \right) \) are limited to the image itself (e.g., not exceeding the image’s borders).

$$\begin{aligned} d_{ch} \left( x \right) =\sum _{k=1}^{256} {\left| {\frac{h_C \left( {x , k} \right) }{\left| {H_C \left( x \right) } \right| }-\frac{h_S \left( {x , k} \right) }{\left| {H_S \left( x \right) } \right| }} \right| } \end{aligned}$$
(A-6)

The LSM, resulting from a nonlinear fusion of the so-called center-surround features, is obtained accordingly to the equation (A-7), where the coefficient \(C_\mu \left( x \right) \) represents the average color saturation over the window \(P\left( x \right) \), computed as average of saturations \(C\left( x \right) \) (please refer to equation (A-6)) using the equation (A-8).

$$\begin{aligned}&M_L \left( x \right) =\frac{1}{1-e^{-C_\mu \left( x \right) }} d_Y \left( x \right) \nonumber \\&\qquad \qquad \quad +\left( {1-\frac{1}{1-e^{-C_\mu \left( x \right) }}} \right) Max \left( { d_{Cr} \left( x \right) , d_{Cb} \left( x \right) } \right) \nonumber \\\end{aligned}$$
(A-7)
$$\begin{aligned}&C_\mu \left( x \right) =\frac{\sum _{l=1}^P {C\left( l \right) } }{P} \end{aligned}$$
(A-8)

Appendix A-2

Table 9 summarizes notations and symbols used in this articles.

Table 9 Notations and symbols used in the article

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Madani, K., Kachurka, V., Sabourin, C. et al. A soft-computing-based approach to artificial visual attention using human eye-fixation paradigm: toward a human-like skill in robot vision. Soft Comput 23, 2369–2389 (2019). https://doi.org/10.1007/s00500-017-2931-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-017-2931-x

Keywords

Navigation