ABSTRACT
Underwater acoustic target recognition (UATR) based on deep learning faces the problem of low recognition accuracy on larger datasets. The UATR-MSG-Transformer (Transformer with messenger tokens for UATR) model is proposed in this paper. The Mel-filter bank (Mel-fbank) and LOFAR spectrogram features of each target noise are extracted and concatenated in the channel dimension as the input, and the Squeeze-and-Excitation (SE) block is used to learn and adjust the weight of each feature in the channel dimension. Then the features are projected into tokens and split into local windows, and a messenger (MSG) token is introduced in each local window to summarize the information within the window and exchange it with other windows. Experimental results show that UATR-MSG-Transformer can effectively improve the accuracy of recognition.
- Kinnunen T, Li H. An overview of text-independent speaker recognition: From features to supervectors[J]. Speech communication, 2010, 52(1): 12-40.Google Scholar
- Adavanne S, Politis A, Nikunen J, Sound event localization and detection of overlapping sources using convolutional recurrent neural networks[J]. IEEE Journal of Selected Topics in Signal Processing, 2018, 13(1): 34-48.Google ScholarCross Ref
- Irfan M, Jiangbin Z, Ali S, DeepShip: An underwater acoustic benchmark dataset and a separable convolution based autoencoder for classification[J]. Expert Systems with Applications, 2021, 183: 115270.Google ScholarDigital Library
- Liu F, Shen T, Luo Z, Underwater target recognition using convolutional recurrent neural networks with 3-D Mel-spectrogram and data augmentation[J]. Applied Acoustics, 2021, 178: 107989.Google ScholarCross Ref
- Hong F, Liu C, Guo L, Underwater acoustic target recognition with resnet18 on shipsear dataset[C]//2021 IEEE 4th International Conference on Electronics Technology (ICET). IEEE, 2021: 1240-1244.Google Scholar
- Santos-Domínguez D, Torres-Guijarro S, Cardenal-López A, ShipsEar: An underwater vessel noise database[J]. Applied Acoustics, 2016, 113: 64-69.Google ScholarCross Ref
- Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions[J]. arXiv preprint arXiv:1511.07122, 2015.Google Scholar
- Wang P, Chen P, Yuan Y, Understanding convolution for semantic segmentation[C]//2018 IEEE winter conference on applications of computer vision (WACV). Ieee, 2018: 1451-1460.Google Scholar
- Gong Y, Chung Y A, Glass J. Ast: Audio spectrogram transformer[J]. arXiv preprint arXiv:2104.01778, 2021.Google Scholar
- Yuan L, Chen Y, Wang T, Tokens-to-token vit: Training vision transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 558-567.Google Scholar
- Feng S, Zhu X. A Transformer-Based Deep Learning Network for Underwater Acoustic Target Recognition[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19: 1-5.Google ScholarCross Ref
- Yin X, Sun X, Liu P, Underwater acoustic target classification based on LOFAR spectrum and convolutional neural network[C]//Proceedings of the 2nd International Conference on Artificial Intelligence and Advanced Manufacture. 2020: 59-63.Google Scholar
- Chen J, Han B, Ma X, Underwater target recognition based on multi-decision lofar spectrum enhancement: A deep-learning approach[J]. Future Internet, 2021, 13(10): 265.Google ScholarCross Ref
- Su Y, Zhang K, Wang J, Environment sound classification using a two-stream CNN based on decision-level fusion[J]. Sensors, 2019, 19(7): 1733.Google ScholarCross Ref
- Tian S, Chen D, Wang H, Deep convolution stack for waveform in underwater acoustic target recognition[J]. Scientific reports, 2021, 11(1): 9614.Google Scholar
- Xu C, Li Y, Zhang M, Underwater Acoustic Target Recognition Based on Feature Fusion and Self-attention Mechanism[J]. Mobile Communications,202,46(06): 91-98.Google Scholar
- Luo D, Fang J, Liu Y. Feature Fusion Methods Based on Channel Domain Attention Mechanism[J]. Journal of northeast normal university (natural science edition), 2021 does (03): 44-48.Google Scholar
- Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.Google Scholar
- Dai Y, Gieseke F, Oehmcke S, Attentional feature fusion[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021: 3560-3569.Google Scholar
- Wang Q, Wu B, Zhu P, ECA-Net: Efficient channel attention for deep convolutional neural networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 11534-11542.Google Scholar
- Izumi Y, Suto D, Kawakami S, Blind Image Restoration and Super-Resolution for Multispectral Images Using Sparse Optimization[J]. Journal of Image and Graphics, 2022, 10(1).Google ScholarCross Ref
- Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.Google Scholar
- Dosovitskiy A, Beyer L, Kolesnikov A, An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.Google Scholar
- Liu Z, Lin Y, Cao Y, Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.Google Scholar
- Fang J, Xie L, Wang X, Msg-transformer: Exchanging local spatial information by manipulating messenger tokens[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 12063-12072.Google Scholar
- Park D S, Chan W, Zhang Y, Specaugment: A simple data augmentation method for automatic speech recognition[J]. arXiv preprint arXiv:1904.08779, 2019.Google Scholar
- Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks[C]//International conference on machine learning. PMLR, 2019: 6105-6114.Google Scholar
- Hsiao S F, Tsai B C. Efficient computation of depthwise separable convolution in MoblieNet deep neural network models[C]//2021 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW). IEEE, 2021: 1-2.Google Scholar
- He K, Zhang X, Ren S, Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.Google Scholar
Index Terms
- UATR-MSG-Transformer: A Deep Learning Network for Underwater Acoustic Target Recognition Based on Spectrogram Feature Fusion and Transformer with Messenger Tokens
Recommendations
Vowel Effects Towards Dental Arabic Consonants Based on Spectrogram
ISMS '11: Proceedings of the 2011 Second International Conference on Intelligent Systems, Modelling and SimulationThis paper discussed the effect of vowel (fatha, kasra and damma) in Arabic consonants. These vowels are added to the basic consonants with three simple diacritics using the utterances of every dental consonant concerned by Malaysian children. The ...
Feature Fusion Methods for Robust Speech Emotion Recognition Based on Deep Belief Networks
ICNCC '16: Proceedings of the Fifth International Conference on Network, Communication and ComputingThe speech emotion recognition accuracy of prosody feature and voice quality feature declines with the decrease of SNR (Signal to Noise Ratio) of speech signals. In this paper, we propose novel sub-band spectral centroid weighted wavelet packet cepstral ...
Deep Learning-Based Acoustic Feature Representations for Dysarthric Speech Recognition
AbstractDysarthria is a motor speech disorder and the most common neurodegenerative disease characterized by low volume in precise articulation, poor coordination of respiratory and pulmonary subsystems, and irregular pauses. The key challenge with ...
Comments