research-article

Self-Supervised Learning Representations for Dialect Identification with Sparse Transformers

Authors:

Qingshun SheAuthors Info & Claims

CCEAI '24: Proceedings of the 2024 8th International Conference on Control Engineering and Artificial Intelligence

Pages 1 - 6

https://doi.org/10.1145/3640824.3640825

Published: 08 March 2024 Publication History

Abstract

Self-supervised learning representations of speech are successfully employed to language or dialect identification task. It usually adopts a self-supervised pre-trained model as the feature extractor, and adds a classifier to fine-tune on the target datasets. However, training a fine-tuned model from a pre-trained model leads to heavy time and storage cost. In this work, we attempt to build a competent dialect identification system using pre-trained self-supervised speech representations instead of fine-tuned input features. First, we simply fine-tune and compare several pre-trained self-supervised models to select an optimal framework as the speech representations extractor. Then, we introduce Transformers with sparse self-attention mechanism to model both global and local information of those representations. Finally, to further boost the performance, we explore some settings of our proposed method. We conduct experiments on the KeSpeech dataset. Experimental results show that, our method significantly outperforms the best known results, with cost average performance (Cavg) of 0.0911, equal error rate (EER) of 8.53%, and accuracy rate (Acc) of 80.38%. In addition, our proposed method obtains overall relative improvements compared to the fine-tuned one requiring less training time and computation.

References

[1]

Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, 2021. XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296 (2021).

[2]

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning. PMLR, 1298–1312.

[3]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.

[4]

Weicheng Cai, Danwei Cai, Shen Huang, and Ming Li. 2019. Utterance-level end-to-end language identification using attention-based CNN-BLSTM. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5991–5995.

[5]

Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2020. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020).

[6]

Amit Das, Kshitiz Kumar, and Jian Wu. 2021. Multi-dialect speech recognition in english using attention on ensemble of experts. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6244–6248.

[7]

Zhiyun Fan, Meng Li, Shiyu Zhou, and Bo Xu. 2020. Exploring wav2vec 2.0 on speaker verification and language identification. arXiv preprint arXiv:2012.06185 (2020).

[8]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).

[9]

Bing Han, Zhengyang Chen, and Yanmin Qian. 2022. Local information modeling with self-attention for speaker verification. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6727–6731.

[10]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.

Digital Library

[11]

Ma Jin, Yan Song, Ian Vince McLoughlin, Wu Guo, and Li-Rong Dai. 2017. End-to-end language identification using high-order utterance representation with bilinear pooling. (2017).

[12]

Stanisław Kacprzak, Magdalena Rybicka, and Konrad Kowalczyk. 2022. Spoken Language Recognition with Cluster-Based Modeling. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6867–6871. https://doi.org/10.1109/ICASSP43922.2022.9747515

[13]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).

[14]

Tianlong Kong, Shouyi Yin, Dawei Zhang, Wang Geng, Xin Wang, Dandan Song, Jinwen Huang, Huiyu Shi, and Xiaorui Wang. 2021. Dynamic multi-scale convolution for dialect identification. arXiv preprint arXiv:2108.07787 (2021).

[15]

Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of transformers. AI Open (2022).

[16]

Hexin Liu, Leibny Paola Garcia Perera, Andy WH Khong, Eng Siong Chng, Suzy J Styles, and Sanjeev Khudanpur. 2022. Efficient Self-Supervised Learning Representations for Spoken Language Identification. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1296–1307.

[17]

Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, 2022. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing (2022).

[18]

Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda. 2018. Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963 (2018).

[19]

OpenAI. 2023. GPT-4 Technical Report. arxiv:2303.08774 [cs.CL]

[20]

Manish Kumar Rai, Md S Fahad, Jainath Yadav, K Sreenivasa Rao, 2016. Language identification using PLDA based on i-vector in noisy environment. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 1014–1020.

[21]

G Ramesh, C Shiva Kumar, and Sri Rama Murty Kodukula. 2021. Self-supervised phonotactic representations for language identification. (2021).

[22]

Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019).

[23]

David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. Spoken language recognition using x-vectors. In Odyssey, Vol. 2018. 105–111.

[24]

Zhiyuan Tang, Dong Wang, Yanguang Xu, Jianwei Sun, Xiaoning Lei, Shuaijiang Zhao, Cheng Wen, Xingjun Tan, Chuandong Xie, Shuran Zhou, 2021. Kespeech: An open source speech dataset of mandarin and its eight subdialects. (2021).

[25]

Andros Tjandra, Diptanu Gon Choudhury, Frank Zhang, Kritika Singh, Alexis Conneau, Alexei Baevski, Assaf Sela, Yatharth Saraf, and Michael Auli. 2022. Improved language identification through cross-lingual self-supervised learning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6877–6881.

[26]

Shubham Toshniwal, Tara N Sainath, Ron J Weiss, Bo Li, Pedro Moreno, Eugene Weinstein, and Kanishka Rao. 2018. Multilingual speech recognition with a single end-to-end model. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4904–4908.

Digital Library

[27]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[28]

Quan Wang, Yang Yu, Jason Pelecanos, Yiling Huang, and Ignacio Lopez Moreno. 2022. Attentive temporal pooling for conformer-based streaming language identification in long-form speech. arXiv preprint arXiv:2202.12163 (2022).

[29]

Rui Wang, Junyi Ao, Long Zhou, Shujie Liu, Zhihua Wei, Tom Ko, Qing Li, and Yu Zhang. 2022. Multi-view self-attention based transformer for speaker recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6732–6736.

[30]

Austin Waters, Neeraj Gaur, Parisa Haghani, Pedro Moreno, and Zhongdi Qu. 2019. Leveraging language id in multilingual end-to-end speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 928–935.

[31]

Fei Xie, Dalong Zhang, and Chengming Liu. 2022. Global–Local Self-Attention Based Transformer for Speaker Verification. Applied Sciences 12, 19 (2022), 10154.

[32]

Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, 2023. Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages. arXiv preprint arXiv:2303.01037 (2023).

Cited By

Angra AMuralikrishna HDinesh DThenkanidiyoor V(2025)Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect IdentificationIEEE Access10.1109/ACCESS.2024.352395113(3115-3129)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2024.3523951
Shen RZhang YLi YJin LHuang J(2024)A Multi-Task Approach with Multi-Grained Information Extraction for Dialect Speech RecognitionProceedings of the 2024 4th International Conference on Artificial Intelligence, Automation and Algorithms10.1145/3700523.3700534(51-56)Online publication date: 27-Sep-2024
https://dl.acm.org/doi/10.1145/3700523.3700534

Index Terms

Self-Supervised Learning Representations for Dialect Identification with Sparse Transformers
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Adversarial Self-supervised Learning for Semi-supervised 3D Action Recognition
Computer Vision – ECCV 2020
Abstract
We consider the problem of semi-supervised 3D action recognition which has been rarely explored before. Its major challenge lies in how to effectively learn motion representations from unlabeled data. Self-supervised learning (SSL) has been proved ...
A debiased self-training framework with graph self-supervised pre-training aided for semi-supervised rumor detection
Abstract
Existing rumor detection models have achieved remarkable performance in fully-supervised settings. However, it is time-consuming and labor-intensive to obtain extensive labeled rumor data. To mitigate the reliance on labeled data, semi-supervised ...
Highlights
- A self-training framework for semi-supervised rumor detection is proposed.
- Graph self-supervised pre-training is employed to alleviate confirmation bias.
- Self-adaptive thresholds are designed to generate reliable pseudo-labels.
Improving Few-Shot Image Classification with Self-supervised Learning
Cloud Computing – CLOUD 2022
Abstract
Few-Shot Image Classification (FSIC) aims to learn an image classifier with only a few training samples. The key challenge of few-shot image classification is to learn this classifier with scarce labeled data. To tackle the issue, we leverage the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CCEAI '24: Proceedings of the 2024 8th International Conference on Control Engineering and Artificial Intelligence

January 2024

297 pages

ISBN:9798400707971

DOI:10.1145/3640824

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Zhejiang Electric Power Co., Ltd.

Conference

CCEAI 2024

CCEAI 2024: 2024 8th International Conference on Control Engineering and Artificial Intelligence

January 26 - 28, 2024

Shanghai, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
59
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)15

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Angra AMuralikrishna HDinesh DThenkanidiyoor V(2025)Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect IdentificationIEEE Access10.1109/ACCESS.2024.352395113(3115-3129)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2024.3523951
Shen RZhang YLi YJin LHuang J(2024)A Multi-Task Approach with Multi-Grained Information Extraction for Dialect Speech RecognitionProceedings of the 2024 4th International Conference on Artificial Intelligence, Automation and Algorithms10.1145/3700523.3700534(51-56)Online publication date: 27-Sep-2024
https://dl.acm.org/doi/10.1145/3700523.3700534

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten