Skip to main content
Log in

A model of co-saliency based audio attention

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Inspired by biological perceptual characteristics in human auditory systems and the mechanisms of saliency detection, we study the relevance constraint between time-frequency characteristics of sound signals and the multiple spectrogram and propose a co-saliency detection method for multiple sound signals in this paper. Then, according to the auditory characteristics of the human ear, the distinctive saliency features from the acoustic channel and the image channel are fused. Finally, an auditory saliency map is obtained to complete the detection of significant sounds. The saliency features of the acoustic channel include the features calculated in the in the temporal and spectral domains of signal, which the temporal saliency features could be represented by the local maximum points in the Power Spectral Density (PSD) curve, and the spectral features could be represented by local maximum points in Mel Frequency Cepstrum Coefficient (MFCC) curve of sound signal. The saliency features of acoustic channel and cross-scale fusion with the contrast cue of spectrogram, whose result is more in line with the human auditory attention mechanism. Finally, combined with the corresponding cue which could reflect the distribution between multiple spectrograms, it could reflect the characteristics of global repeatability, and reflect high frequency of occurrence. Experimentally, the auditory Co-Saliency map verifies the accuracy and robustness of proposed method in this paper. It shows that the proposed method is superior to other traditional detection methods for auditory saliency, and can implement intelligent automatic detection to sound signals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Achanta R, Hemami S, Estrada F, et al. (2009) Frequency-tuned salient region detection[C]. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE

  2. Achanta S, Hemami S, Estrada FJ, Ssstrunk S (2009) Frequency-tuned salient region detection. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)

  3. Badshah AM, Ahmad J, Rahim N, et al. (2017) Speech emotion recognition from spectrograms with deep convolutional neural network[C]. International conference on platform technology and service, IEEE

  4. Badshah AM, Rahim N, Ullah N, et al. (2017) Deep features-based speech emotion recognition for smart affective services[J]. Multimed Tools Appl

  5. Cano P, Batlle E, Kalker T, et al. (2002) A review of algorithms for audio fingerprinting[m]. A review of algorithms for audio fingerprinting

  6. Cao X, Cheng Y, Tao Z, et al. (2014) Co-saliency detection via base reconstruction[C]. ACM International conference on multimedia, ACM

  7. Chang KY, Liu TL, Lai SH (2011) From co-saliency to co-segmentation: An efficient and fully unsupervised energy minimization model[C]. The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011

  8. Chen M, Cheng G, Guo L (2017) Identifying affective levels on music video via completing the missing modality[J]. Multimedia Tools and Applications

  9. Cheng MM, Zhang GX, Mitra NJ, et al. (2011) Global contrast based salient region detection[C]. 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE

  10. Cooke M (1993) Modelling Auditory Processing and Organisation[M]. Modelling auditory processing and organisation. Cambridge University Press

  11. Darabi K, Ghinea G (2017) User-centred personalised video abstraction approach adopting SIFT features[J]. Multimedia Tools and Applications 76(2):2353–2378

    Google Scholar 

  12. Duangudom V, Anderson DV (2015) Using auditory saliency to understand complex auditory scenes[C]. European signal processing conference

  13. Fu H, Cao X, Tu Z (2013) Cluster-based co-saliency detection[J]. IEEE Trans Image Process 22(10):3766–3778

    MathSciNet  MATH  Google Scholar 

  14. Fu H, Xu D, Zhang B, et al. (2015) Object-based multiple foreground video co-segmentation via multi-state selection graph[J]. IEEE Trans Image Process 24 (11):3415–3424

    MathSciNet  MATH  Google Scholar 

  15. Gao D, Guo A, Zhao D, et al. (2007) The discriminant center-surround hypothesis for bottom-up saliency[C]. International conference on neural information processing systems, Curran Associates Inc

  16. Ge C, Fu K, Liu F, et al. (2016) Co-saliency detection via inter and intra saliency propagation[J]. Signal Process Image Commun 44:69–83

    Google Scholar 

  17. Han J, Cheng G, Li Z, et al. (2017) A unified metric learning-based framework for co-saliency detection[J]. IEEE Trans Circuits Sys Vid Technol, pp 1–1

  18. Han J, Zhang D, Cheng G, et al. (2018) Advanced deep-learning techniques for salient and category-specific object detection: A survey[J]. IEEE Signal Process Mag 35(1):84–100

    Google Scholar 

  19. Hou X, Zhang L (2007) Saliency detection: A spectral residual approach[C]. IEEE conference on computer vision pattern recognition

  20. Hou X, Zhang L (2009) Saliency detection: a spectral residual approach. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)

  21. Hsu KJ, Tsai CC, et al. (2018) Unsupervised CNN-based co-saliency detection with graphical optimization[J]. Multimed Tools Applications, pp 502–518

  22. Huang N, Elhilali M (2163) Auditory salience using natural soundscapes.[J]. Journal of the Acoustical Society of America 141(3):2017

    Google Scholar 

  23. Huang Z, Xue W, Mao Q, et al. (2017) Unsupervised domain adaptation for speech emotion recognition using PCANet[J]. Multimed Tools Appl 76(5):6785–6799

    Google Scholar 

  24. Itti L, Koch C (2001) Computational modelling of visual attention[J]. Nat Rev Neuroscience 2(3):194–203

    Google Scholar 

  25. Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis[J]

  26. Jacobs DE, Dan BG, Shechtman E (2010) Cosaliency: Where people look when comparing images[C]. Acm symposium on user interface software technology

  27. Kalinli O, Narayanan SS (2007) A saliency-based auditory attention model with applications to unsupervised prominent syllable detection in speech[J]. Interspeech

  28. Kaya EM, Elhilali M (2012) A temporal saliency map for modeling auditory attention[C]. Information sciences systems

  29. Kayser C, Petkov CI, Lippert M, et al. (2005) Mechanisms for allocating auditory attention: An auditory saliency map[J]. Curr Biol 15(21):1943–1947

    Google Scholar 

  30. Kim K, Lin KH, Walther DB, et al. (2014) Automatic detection of auditory salience with optimized linear filters derived from human annotation[J]. Pattern Recogn Lett 38(1):78–85

    Google Scholar 

  31. Kleinschmidt M (2002) Methods for capturing spectro-temporal modulations in automatic speech recognition[J]. Acta Acustica United with Acustica 8(3):416–422(7)

    Google Scholar 

  32. Kumar K, Singh R, Raj B, et al. (2011) Gammatone sub-band magnitude-domain dereverberation for ASR[C]. IEEE international conference on acoustics, IEEE

  33. Lang C, Nguyen TV, Katti H, et al. (2012) Depth matters: Influence of depth cues on visual saliency[C]. ECCV (2), Springer

  34. Lee DK, Itti L, Koch C, et al. (1999) Attention activates winner-take-all competition among visual filters[J]. Nat Neurosci 2(4):375–381

    Google Scholar 

  35. Li H, Ngan KN (2011) A co-saliency model of image pairs[J]. IEEE Trans Image Process 20(12):3365–3375

    MathSciNet  MATH  Google Scholar 

  36. Li H, Ngan KN (2011) A co-saliency model of map pairs[J]. IEEE Transactions on map Processing 20(12):3365–3375

    MATH  Google Scholar 

  37. Li Y, Fu K, Liu Z, et al. (2015) Efficient saliency-model-guided visual co-saliency detection[J]. IEEE Signal Process Lett 22(5):588–592

    Google Scholar 

  38. Liu T, Sun J, Zheng NN, et al. (2007) Learning to detect a salient object[C]. IEEE conference on computer vision pattern recognition, IEEE

  39. Mahadevan V, Vasconcelos N (2009) Spatiotemporal saliency in dynamic scenes[J]. IEEE Trans Pattern Anal Mach Intell 32(1):171–177

    Google Scholar 

  40. Navalpakkam V, Itti L (2005) Modeling the influence of task on attention[J]. Vision Res 45(2):205–231

    Google Scholar 

  41. Petit C, El-Amraoui A, Avan P (2016) Audition: Hearing and deafness[M]

  42. Platt J, Hofmann T, et al. (2006) Graph-based visual saliency[C]. International conference on neural information processing systems

  43. Radhakrishna A, Sheila H, Francisco E, et al. (2009) Frequency-tuned salient region detection[C]. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE

  44. Sarkar R, Choudhury S, Dutta S, et al. (2020) Recognition of emotion in music based on deep convolutional neural network[J]. Multimed Tools Applications 79:765–783

    Google Scholar 

  45. Schreiner CE, Read HL, Sutter ML (2000) Modular organization of frequency integration in primary auditory cortex.[J]. Annu Rev Neurosci 23(1):501–529

    Google Scholar 

  46. Tao XU, Songmin J, Guoliang Z (2017) Fast spatial object location method for service robot based on co-saliency[j]. Robot

  47. Treisman AM, Gelade G, Gelade G (1980) A feature integration theory of attention. Cog Psychol 12(1):97–136

    Google Scholar 

  48. Venkitaraman A, Adiga A, Seelamantula CS (2014) Auditory-motivated Gammatone wavelet transform[J]. Signal Process 94(1):608–619

    Google Scholar 

  49. Wang J, Zhang K, Madani K, et al. (2014) A visualized acoustic saliency feature extraction method for environment sound signal processing[C]. Tencon IEEE Region 10 Conference, IEEE

  50. Wang J, Zhang K, Madani K, et al. (2015) Salient environmental sound detection framework for machine awareness[J]. Neurocomputing 152:444–454

    Google Scholar 

  51. Xie Y, Liu Z, Zhou X, et al. (2019) Video co-segmentation based on directed graph[J]. Multimedia Tools and Applications

  52. Yin H, Hohmann V, Nadeu C (2011) Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency[J]. Speech Comm 53 (5):707–715

    Google Scholar 

  53. Zhai Y, Shah M (2006) Visual attention detection in video sequences using spatiotemporal cues[C]. Proceedings of the 14th ACM International Conference on Multimedia, Santa Barbara, CA, USA, October 23–27, 2006 ACM

  54. Zhang D, Fu H, Han J, et al. (2016) A review of co-saliency detection technique: Fundamentals, applications, and challenges[J]

  55. Zhang D, Han J, Han J, et al. (2015) Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining[J]. IEEE Trans Neural Netw Learning Sys 27(6):1–14

    MathSciNet  Google Scholar 

  56. Zhang D, Han J, Li C, et al. (2016) Detection of co-salient objects by looking deep and wide[j]. Int J Comput Vis 120(2):215–232

    MathSciNet  Google Scholar 

  57. Zhang D, Meng D, Han J (2017) Co-saliency detection via a self-paced multiple-instance learning framework[J]. IEEE transactions on pattern analysis and machine intelligence

  58. Zhang Q-Y, Zhou L, Zhang T, et al. (2019) A retrieval algorithm of encrypted speech based on short-term cross-correlation and perceptual hashing[J]. Multimedia Tools and Applications 78:17825–17846

    Google Scholar 

  59. Zhang Y, Han J, Guo L, et al. (2013) A new algorithm for detecting co-saliency in multiple images through sparse coding representation[J]. Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University 31(2):206–209

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinxin Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, X., Wang, X. & Cheng, D. A model of co-saliency based audio attention. Multimed Tools Appl 79, 23045–23069 (2020). https://doi.org/10.1007/s11042-020-09020-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09020-3

Keywords

Navigation