skip to main content
10.1145/3640824.3640825acmotherconferencesArticle/Chapter ViewAbstractPublication PagescceaiConference Proceedingsconference-collections
research-article

Self-Supervised Learning Representations for Dialect Identification with Sparse Transformers

Published: 08 March 2024 Publication History

Abstract

Self-supervised learning representations of speech are successfully employed to language or dialect identification task. It usually adopts a self-supervised pre-trained model as the feature extractor, and adds a classifier to fine-tune on the target datasets. However, training a fine-tuned model from a pre-trained model leads to heavy time and storage cost. In this work, we attempt to build a competent dialect identification system using pre-trained self-supervised speech representations instead of fine-tuned input features. First, we simply fine-tune and compare several pre-trained self-supervised models to select an optimal framework as the speech representations extractor. Then, we introduce Transformers with sparse self-attention mechanism to model both global and local information of those representations. Finally, to further boost the performance, we explore some settings of our proposed method. We conduct experiments on the KeSpeech dataset. Experimental results show that, our method significantly outperforms the best known results, with cost average performance (Cavg) of 0.0911, equal error rate (EER) of 8.53%, and accuracy rate (Acc) of 80.38%. In addition, our proposed method obtains overall relative improvements compared to the fine-tuned one requiring less training time and computation.

References

[1]
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, 2021. XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296 (2021).
[2]
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning. PMLR, 1298–1312.
[3]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
[4]
Weicheng Cai, Danwei Cai, Shen Huang, and Ming Li. 2019. Utterance-level end-to-end language identification using attention-based CNN-BLSTM. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5991–5995.
[5]
Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2020. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020).
[6]
Amit Das, Kshitiz Kumar, and Jian Wu. 2021. Multi-dialect speech recognition in english using attention on ensemble of experts. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6244–6248.
[7]
Zhiyun Fan, Meng Li, Shiyu Zhou, and Bo Xu. 2020. Exploring wav2vec 2.0 on speaker verification and language identification. arXiv preprint arXiv:2012.06185 (2020).
[8]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).
[9]
Bing Han, Zhengyang Chen, and Yanmin Qian. 2022. Local information modeling with self-attention for speaker verification. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6727–6731.
[10]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
[11]
Ma Jin, Yan Song, Ian Vince McLoughlin, Wu Guo, and Li-Rong Dai. 2017. End-to-end language identification using high-order utterance representation with bilinear pooling. (2017).
[12]
Stanisław Kacprzak, Magdalena Rybicka, and Konrad Kowalczyk. 2022. Spoken Language Recognition with Cluster-Based Modeling. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6867–6871. https://doi.org/10.1109/ICASSP43922.2022.9747515
[13]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).
[14]
Tianlong Kong, Shouyi Yin, Dawei Zhang, Wang Geng, Xin Wang, Dandan Song, Jinwen Huang, Huiyu Shi, and Xiaorui Wang. 2021. Dynamic multi-scale convolution for dialect identification. arXiv preprint arXiv:2108.07787 (2021).
[15]
Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of transformers. AI Open (2022).
[16]
Hexin Liu, Leibny Paola Garcia Perera, Andy WH Khong, Eng Siong Chng, Suzy J Styles, and Sanjeev Khudanpur. 2022. Efficient Self-Supervised Learning Representations for Spoken Language Identification. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1296–1307.
[17]
Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, 2022. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing (2022).
[18]
Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda. 2018. Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963 (2018).
[19]
OpenAI. 2023. GPT-4 Technical Report. arxiv:2303.08774 [cs.CL]
[20]
Manish Kumar Rai, Md S Fahad, Jainath Yadav, K Sreenivasa Rao, 2016. Language identification using PLDA based on i-vector in noisy environment. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 1014–1020.
[21]
G Ramesh, C Shiva Kumar, and Sri Rama Murty Kodukula. 2021. Self-supervised phonotactic representations for language identification. (2021).
[22]
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019).
[23]
David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. Spoken language recognition using x-vectors. In Odyssey, Vol. 2018. 105–111.
[24]
Zhiyuan Tang, Dong Wang, Yanguang Xu, Jianwei Sun, Xiaoning Lei, Shuaijiang Zhao, Cheng Wen, Xingjun Tan, Chuandong Xie, Shuran Zhou, 2021. Kespeech: An open source speech dataset of mandarin and its eight subdialects. (2021).
[25]
Andros Tjandra, Diptanu Gon Choudhury, Frank Zhang, Kritika Singh, Alexis Conneau, Alexei Baevski, Assaf Sela, Yatharth Saraf, and Michael Auli. 2022. Improved language identification through cross-lingual self-supervised learning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6877–6881.
[26]
Shubham Toshniwal, Tara N Sainath, Ron J Weiss, Bo Li, Pedro Moreno, Eugene Weinstein, and Kanishka Rao. 2018. Multilingual speech recognition with a single end-to-end model. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4904–4908.
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[28]
Quan Wang, Yang Yu, Jason Pelecanos, Yiling Huang, and Ignacio Lopez Moreno. 2022. Attentive temporal pooling for conformer-based streaming language identification in long-form speech. arXiv preprint arXiv:2202.12163 (2022).
[29]
Rui Wang, Junyi Ao, Long Zhou, Shujie Liu, Zhihua Wei, Tom Ko, Qing Li, and Yu Zhang. 2022. Multi-view self-attention based transformer for speaker recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6732–6736.
[30]
Austin Waters, Neeraj Gaur, Parisa Haghani, Pedro Moreno, and Zhongdi Qu. 2019. Leveraging language id in multilingual end-to-end speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 928–935.
[31]
Fei Xie, Dalong Zhang, and Chengming Liu. 2022. Global–Local Self-Attention Based Transformer for Speaker Verification. Applied Sciences 12, 19 (2022), 10154.
[32]
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, 2023. Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages. arXiv preprint arXiv:2303.01037 (2023).

Cited By

View all
  • (2025)Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect IdentificationIEEE Access10.1109/ACCESS.2024.352395113(3115-3129)Online publication date: 2025
  • (2024)A Multi-Task Approach with Multi-Grained Information Extraction for Dialect Speech RecognitionProceedings of the 2024 4th International Conference on Artificial Intelligence, Automation and Algorithms10.1145/3700523.3700534(51-56)Online publication date: 27-Sep-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CCEAI '24: Proceedings of the 2024 8th International Conference on Control Engineering and Artificial Intelligence
January 2024
297 pages
ISBN:9798400707971
DOI:10.1145/3640824
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. self-supervised learning
  2. sparse Transformers
  3. speech representations

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Zhejiang Electric Power Co., Ltd.

Conference

CCEAI 2024

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)15
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect IdentificationIEEE Access10.1109/ACCESS.2024.352395113(3115-3129)Online publication date: 2025
  • (2024)A Multi-Task Approach with Multi-Grained Information Extraction for Dialect Speech RecognitionProceedings of the 2024 4th International Conference on Artificial Intelligence, Automation and Algorithms10.1145/3700523.3700534(51-56)Online publication date: 27-Sep-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media