research-article

UATR-MSG-Transformer: A Deep Learning Network for Underwater Acoustic Target Recognition Based on Spectrogram Feature Fusion and Transformer with Messenger Tokens

Authors:
Hao Zhou

Department of Information Science and Engineering, Ocean University of China, China

Department of Information Science and Engineering, Ocean University of China, China

0000-0002-1548-1623
View Profile

,
Xuening Wang

Department of Information Science and Engineering, Ocean University of China, China

Department of Information Science and Engineering, Ocean University of China, China

0009-0003-2698-2348
View Profile

,
Peishun Liu

Department of Information Science and Engineering, Ocean University of China, China

Department of Information Science and Engineering, Ocean University of China, China

0000-0002-7746-8061
View Profile

,
Liang Wang

Department of Information Science and Engineering, Ocean University of China, China

Department of Information Science and Engineering, Ocean University of China, China

0000-0003-2909-7320
View Profile

,
Ruichun Tang

Department of Information Science and Engineering, Ocean University of China, China

Department of Information Science and Engineering, Ocean University of China, China

0000-0002-4273-8623
View Profile

ICDIP '23: Proceedings of the 15th International Conference on Digital Image ProcessingMay 2023Article No.: 18Pages 1–9https://doi.org/10.1145/3604078.3604096

Published:26 October 2023Publication History

ICDIP '23: Proceedings of the 15th International Conference on Digital Image Processing

Pages 1–9

ABSTRACT

Underwater acoustic target recognition (UATR) based on deep learning faces the problem of low recognition accuracy on larger datasets. The UATR-MSG-Transformer (Transformer with messenger tokens for UATR) model is proposed in this paper. The Mel-filter bank (Mel-fbank) and LOFAR spectrogram features of each target noise are extracted and concatenated in the channel dimension as the input, and the Squeeze-and-Excitation (SE) block is used to learn and adjust the weight of each feature in the channel dimension. Then the features are projected into tokens and split into local windows, and a messenger (MSG) token is introduced in each local window to summarize the information within the window and exchange it with other windows. Experimental results show that UATR-MSG-Transformer can effectively improve the accuracy of recognition.

References

Kinnunen T, Li H. An overview of text-independent speaker recognition: From features to supervectors[J]. Speech communication, 2010, 52(1): 12-40.Google Scholar
Adavanne S, Politis A, Nikunen J, Sound event localization and detection of overlapping sources using convolutional recurrent neural networks[J]. IEEE Journal of Selected Topics in Signal Processing, 2018, 13(1): 34-48.Google ScholarCross Ref
Irfan M, Jiangbin Z, Ali S, DeepShip: An underwater acoustic benchmark dataset and a separable convolution based autoencoder for classification[J]. Expert Systems with Applications, 2021, 183: 115270.Google ScholarDigital Library
Liu F, Shen T, Luo Z, Underwater target recognition using convolutional recurrent neural networks with 3-D Mel-spectrogram and data augmentation[J]. Applied Acoustics, 2021, 178: 107989.Google ScholarCross Ref
Hong F, Liu C, Guo L, Underwater acoustic target recognition with resnet18 on shipsear dataset[C]//2021 IEEE 4th International Conference on Electronics Technology (ICET). IEEE, 2021: 1240-1244.Google Scholar
Santos-Domínguez D, Torres-Guijarro S, Cardenal-López A, ShipsEar: An underwater vessel noise database[J]. Applied Acoustics, 2016, 113: 64-69.Google ScholarCross Ref
Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions[J]. arXiv preprint arXiv:1511.07122, 2015.Google Scholar
Wang P, Chen P, Yuan Y, Understanding convolution for semantic segmentation[C]//2018 IEEE winter conference on applications of computer vision (WACV). Ieee, 2018: 1451-1460.Google Scholar
Gong Y, Chung Y A, Glass J. Ast: Audio spectrogram transformer[J]. arXiv preprint arXiv:2104.01778, 2021.Google Scholar
Yuan L, Chen Y, Wang T, Tokens-to-token vit: Training vision transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 558-567.Google Scholar
Feng S, Zhu X. A Transformer-Based Deep Learning Network for Underwater Acoustic Target Recognition[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19: 1-5.Google ScholarCross Ref
Yin X, Sun X, Liu P, Underwater acoustic target classification based on LOFAR spectrum and convolutional neural network[C]//Proceedings of the 2nd International Conference on Artificial Intelligence and Advanced Manufacture. 2020: 59-63.Google Scholar
Chen J, Han B, Ma X, Underwater target recognition based on multi-decision lofar spectrum enhancement: A deep-learning approach[J]. Future Internet, 2021, 13(10): 265.Google ScholarCross Ref
Su Y, Zhang K, Wang J, Environment sound classification using a two-stream CNN based on decision-level fusion[J]. Sensors, 2019, 19(7): 1733.Google ScholarCross Ref
Tian S, Chen D, Wang H, Deep convolution stack for waveform in underwater acoustic target recognition[J]. Scientific reports, 2021, 11(1): 9614.Google Scholar
Xu C, Li Y, Zhang M, Underwater Acoustic Target Recognition Based on Feature Fusion and Self-attention Mechanism[J]. Mobile Communications,202,46(06): 91-98.Google Scholar
Luo D, Fang J, Liu Y. Feature Fusion Methods Based on Channel Domain Attention Mechanism[J]. Journal of northeast normal university (natural science edition), 2021 does (03): 44-48.Google Scholar
Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.Google Scholar
Dai Y, Gieseke F, Oehmcke S, Attentional feature fusion[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021: 3560-3569.Google Scholar
Wang Q, Wu B, Zhu P, ECA-Net: Efficient channel attention for deep convolutional neural networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 11534-11542.Google Scholar
Izumi Y, Suto D, Kawakami S, Blind Image Restoration and Super-Resolution for Multispectral Images Using Sparse Optimization[J]. Journal of Image and Graphics, 2022, 10(1).Google ScholarCross Ref
Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.Google Scholar
Liu Z, Lin Y, Cao Y, Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.Google Scholar
Fang J, Xie L, Wang X, Msg-transformer: Exchanging local spatial information by manipulating messenger tokens[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 12063-12072.Google Scholar
Park D S, Chan W, Zhang Y, Specaugment: A simple data augmentation method for automatic speech recognition[J]. arXiv preprint arXiv:1904.08779, 2019.Google Scholar
Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks[C]//International conference on machine learning. PMLR, 2019: 6105-6114.Google Scholar
Hsiao S F, Tsai B C. Efficient computation of depthwise separable convolution in MoblieNet deep neural network models[C]//2021 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW). IEEE, 2021: 1-2.Google Scholar
He K, Zhang X, Ren S, Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.Google Scholar

Index Terms

UATR-MSG-Transformer: A Deep Learning Network for Underwater Acoustic Target Recognition Based on Spectrogram Feature Fusion and Transformer with Messenger Tokens
1. Applied computing
  1. Computers in other domains
    1. Military
2. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Vowel Effects Towards Dental Arabic Consonants Based on Spectrogram
ISMS '11: Proceedings of the 2011 Second International Conference on Intelligent Systems, Modelling and Simulation

This paper discussed the effect of vowel (fatha, kasra and damma) in Arabic consonants. These vowels are added to the basic consonants with three simple diacritics using the utterances of every dental consonant concerned by Malaysian children. The ...
Read More
Feature Fusion Methods for Robust Speech Emotion Recognition Based on Deep Belief Networks
ICNCC '16: Proceedings of the Fifth International Conference on Network, Communication and Computing

The speech emotion recognition accuracy of prosody feature and voice quality feature declines with the decrease of SNR (Signal to Noise Ratio) of speech signals. In this paper, we propose novel sub-band spectral centroid weighted wavelet packet cepstral ...
Read More
Deep Learning-Based Acoustic Feature Representations for Dysarthric Speech Recognition
Abstract
Dysarthria is a motor speech disorder and the most common neurodegenerative disease characterized by low volume in precise articulation, poor coordination of respiratory and pulmonary subsystems, and irregular pauses. The key challenge with ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICDIP '23: Proceedings of the 15th International Conference on Digital Image Processing
May 2023
711 pages
ISBN:9798400708237
DOI:10.1145/3604078

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Channel attention
Feature fusion
Spectrogram
UATR
UATR-MSG-Transformer
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 38
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

UATR-MSG-Transformer: A Deep Learning Network for Underwater Acoustic Target Recognition Based on Spectrogram Feature Fusion and Transformer with Messenger Tokens

ICDIP '23: Proceedings of the 15th International Conference on Digital Image Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Vowel Effects Towards Dental Arabic Consonants Based on Spectrogram

Feature Fusion Methods for Robust Speech Emotion Recognition Based on Deep Belief Networks

Deep Learning-Based Acoustic Feature Representations for Dysarthric Speech Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

UATR-MSG-Transformer: A Deep Learning Network for Underwater Acoustic Target Recognition Based on Spectrogram Feature Fusion and Transformer with Messenger Tokens

ICDIP '23: Proceedings of the 15th International Conference on Digital Image Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Vowel Effects Towards Dental Arabic Consonants Based on Spectrogram

Feature Fusion Methods for Robust Speech Emotion Recognition Based on Deep Belief Networks

Deep Learning-Based Acoustic Feature Representations for Dysarthric Speech Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media