Enhancing spoken term detection with deep acoustic word embeddings and cross-modal matching techniques

Chantangphol, Pantid; Sakdejayont, Theerat; Chalothorn, Tawunrat

doi:10.1007/s10772-024-10145-1

Enhancing spoken term detection with deep acoustic word embeddings and cross-modal matching techniques

Published: 16 October 2024

Volume 27, pages 875–886, (2024)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Pantid Chantangphol ORCID: orcid.org/0000-0001-5465-0760¹,
Theerat Sakdejayont¹^na1 &
Tawunrat Chalothorn¹^na1

81 Accesses
Explore all metrics

Abstract

This study proposes a new approach to improving spoken term detection by employing Acoustic Word Embeddings. Our model combines CNNs and LSTM networks to capture sequential information and generate fixed-dimensional word-level embeddings. We have introduced a novel deep word discrimination loss to increase the distinctiveness of these embeddings, thereby improving word differentiation. Additionally, we have developed a matching scheme that utilizes a neural network framework alongside a text-to-speech technique to generate acoustic embeddings from text. These embeddings are crucial for effective cross-modal retrieval and audio indexing, especially in detecting unseen words. Our experimental results demonstrate that our method outperforms traditional baselines in word discrimination tasks, achieving higher mean Average Precision scores. Furthermore, our matching scheme significantly enhances spoken term detection for both regular and unseen words, which could pave the way for future advances in audio indexing, cross-modal retrieval, and search functionalities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multilingual spoken term detection: a review

Article 22 July 2020

Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion

Article Open access 07 August 2015

Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations

Article Open access 13 January 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Ardila, R., Branson, M., Davis, K., Henretty , M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., & Weber, G. (2019). Common voice: A massively-multilingual speech corpus. In International conference on language resources and evaluation. https://api.semanticscholar.org/CorpusID:209376338
Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., Morais, R., Saunders, L., Tyers, F., & Weber, G. (2020). Common voice: A massively-multilingual speech corpus. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the twelfth language resources and evaluation conference (pp. 4218–4222). European Language Resources Association, Marseille, France. https://aclanthology.org/2020.lrec-1.520
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th international conference on neural information processing systems. Curran Associates Inc., Red Hook, NY, USA, NIPS’20. https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html
Chantangphol, P., Sakdejayont, T., & Chalothorn, T. (2023). Enhancing word discrimination and matching in query-by-example spoken term detection with acoustic word embeddings. In M. Abbas, & A. A. Freihat (Eds.), Proceedings of the 6th international conference on natural language and speech processing (ICNLSP 2023) (pp. 293–302),virtual event, 16–17 December 2023. Association for Computational Linguistics. https://aclanthology.org/2023.icnlsp-1.31
Chen, G., Parada, C., & Sainath, T. N. (2015). Query-by-example keyword spotting using long short-term memory networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5236–5240). https://doi.org/10.1109/ICASSP.2015.7178970
Chen, H., Leung, C.-C., Xie, L., Ma, B., & Li, H. (2016). Unsupervised bottleneck features for low-resource query-by-example spoken term detection. In N. Morgan (Ed.), 17th annual conference of the international speech communication association (Interspeech) (pp. 923–927). ISCA. https://doi.org/10.21437/INTERSPEECH.2016-313
Chapter Google Scholar
Chuangsuwanich, E., Suchato, A., Karunratanakul, K., Naowarat, B., Chaichot, C., Sangsa-nga, P., Anutarases, T., Chaipojjana, N., & Chaichana, Y. (2020). Gowajee corpus. Technical report, Chulalongkorn University, Faculty of Engineering, Computer Engineering Department. https://github.com/ekapolc/gowajee_corpus
Chung, Y., & Glass, J. R. (2018). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. In B. Yegnanarayana (Ed.), 19th annual conference of the international speech communication association (Interspeech) (pp. 811–815). ISCA. https://doi.org/10.21437/INTERSPEECH.2018-2341
Chapter Google Scholar
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M. (2021). Unsupervised cross-lingual representation learning for speech recognition. In H. Hermansky, H. Cernocký, L. Burget, L. Lamel, O. Scharenborg, & P. Motlícek (Eds.), 22nd annual conference of the international speech communication association (Interspeech) (pp. 2426–2430). ISCA. https://doi.org/10.21437/INTERSPEECH.2021-329
Chapter Google Scholar
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S., Dahlgren, N. L., & Zue, V. (1983). Timit acoustic-phonetic continuous speech corpus. https://doi.org/10.35111/17gk-bn40
Ge, W., Huang, W., Dong, D., & Scott, M. R. (2018). Deep metric learning with hierarchical triplet loss. In V. Ferrari, M. Hebert, C. Sminchisescu & Y. Weiss (Eds.), Computer vision–ECCV 2018 (pp. 272–288). Springer. https://doi.org/10.1007/978-3-030-01231-1_17
Chapter Google Scholar
Holzenberger, N., Du, M., Karadayi, J., Riad, R., & Dupoux, E. (2018). Learning word embeddings: Unsupervised methods for fixed-size representations of variable-length speech segments. In B. Yegnanarayana (Ed.), 19th annual conference of the international speech communication association (Interspeech). ISCA. https://doi.org/10.21437/INTERSPEECH.2018-2364
Chapter Google Scholar
Hsu, W.-N., Bolte, B., Tsai, Y.-H.H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460. https://doi.org/10.1109/TASLP.2021.3122291
Article Google Scholar
Kamper, H., Anastassiou, A., & Livescu, K. (2019). Semantic query-by-example speech search using visual grounding. In 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP 2019) (pp. 7120–7124). https://doi.org/10.1109/ICASSP.2019.8683275
Kamper, H., Wang, W., & Livescu, K. (2016). Deep convolutional acoustic word embeddings using word-pair side information. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4950–4954). https://doi.org/10.1109/ICASSP.2016.7472619
Lertpiya, A., Chaiwachirasak, T., Maharattanamalai, N., Lapjaturapit, T., Chalothorn, T., Tirasaroj, N., & Chuangsuwanich, E. (2018). A preliminary study on fundamental Thai NLP tasks for user-generated web content. In 2018 international joint symposium on artificial intelligence and natural language processing (iSAI-NLP) (pp. 1–8). https://api.semanticscholar.org/CorpusID:121396330
Levin, K., Henry, K., Jansen, A., & Livescu, K. (2013). Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings. In 2013 IEEE workshop on automatic speech recognition and understanding (pp. 410–415). https://doi.org/10.1109/ASRU.2013.6707765
Lopez-Otero, P., Parapar, J., & Barreiro, A. (2019). Efficient query-by-example spoken document retrieval combining phone multigram representation and dynamic time warping. Information Processing and Management, 56(1), 43–60. https://doi.org/10.1016/j.ipm.2018.09.002
Article Google Scholar
Ma, M., Wu, H., Wang, X., Yang, L., Wang, J., & Li, M. (2021). Acoustic word embedding system for code-switching query-by-example spoken term detection. In 2021 12th international symposium on Chinese spoken language processing (ISCSLP) (pp. 1–5). https://doi.org/10.1109/ISCSLP49672.2021.9362056
Madhavi, M. C., & Patil, H. A. (2017). Partial matching and search space reduction for QbE-STD. Computer Speech and Language, 45, 58–82. https://doi.org/10.1016/j.csl.2017.03.004
Article Google Scholar
Mandal, A., Kumar, K. R. P., & Mitra, P. (2014). Recent developments in spoken term detection: A survey. International Journal of Speech Technology, 17(2), 183–198. https://doi.org/10.1007/S10772-013-9217-1
Article Google Scholar
Mantena, G., Achanta, S., & Prahallad, K. (2014). Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(5), 946–955. https://doi.org/10.1109/TASLP.2014.2311322
Article Google Scholar
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In F. Lacerda (Ed.), 18th annual conference of the international speech communication association (Interspeech) (pp. 498–502). ISCA. https://doi.org/10.21437/INTERSPEECH.2017-1386
Chapter Google Scholar
Naik, P., Gaonkar, M. N., Thenkanidiyoor, V., & Dileep, A. D. (2020). Kernel based matching and a novel training approach for CNN-based QbE-STD. In 2020 international conference on signal processing and communications (SPCOM) (pp. 1–5). https://doi.org/10.1109/SPCOM50965.2020.9179588
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5206–5210). https://doi.org/10.1109/ICASSP.2015.7178964
Pitt, M. A., Johnson, K., Hume, E., Kiesling, S., & Raymond, W. (2005). The buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Communication, 45(1), 89–95. https://doi.org/10.1016/j.specom.2004.09.001
Article Google Scholar
Ram, D., Miculicich, L., & Bourlard, H. (2018). CNN based query by example spoken term detection. In B. Yegnanarayana (Ed.), 19th annual conference of the international speech communication association (Interspeech) (pp. 92–96). ISCA. https://doi.org/10.21437/INTERSPEECH.2018-1722
Chapter Google Scholar
Ram, D., Miculicich, L., & Bourlard, H. (2020). Neural network based end-to-end query by example spoken term detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1416–1427. https://doi.org/10.1109/TASLP.2020.2988788
Article Google Scholar
Ram, S., & Aldarmaki, H. (2022). Supervised acoustic embeddings and their transferability across languages. In M. Abbas, & A. A. Freihat (Eds.), Proceedings of the 5th international conference on natural language and speech processing (ICNLSP 2022) (pp. 212–218). Association for Computational Linguistics, Trento, Italy. https://aclanthology.org/2022.icnlsp-1.24
Reddy, C. K. A., Beyrami, E., Pool, J., Cutler, R., Srinivasan, S., & Gehrke, J. (2019). A scalable noisy speech dataset and online subjective test framework. In Interspeech. https://api.semanticscholar.org/CorpusID:202660998
Rodriguez-Fuentes, L. J., Varona, A., Penagarikano, M., Bordel, G., & Diez, M. (2014). High-performance query-by-example spoken term detection on the SWS 2013 evaluation. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7819–7823). https://doi.org/10.1109/ICASSP.2014.6855122
Sanabria, R., Tang, H., & Goldwater, S. (2023). Analyzing acoustic word embeddings from pre-trained self-supervised speech models. In 2023 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5). https://doi.org/10.1109/ICASSP49357.2023.10096099
Seamless Communication, Barrault, L., Chung, Y.-A., Meglioli, M. C., Dale, D., Dong, N., Duppenthaler, M., Duquenne, P.-A., Ellis, B., ElSahar, H., Haaheim, J., Hoffman, J., Hwang, M.-J., Inaguma, H., Klaiber, C., Kulikov, I., Li, P. L., Daniel, M., Jean, M., ... Williamson, M. (2023). Seamless: Multilingual expressive and streaming speech translation. arXiv. https://arxiv.org/abs/2312.05187. https://api.semanticscholar.org/CorpusID:266149504
Settle, S., Levin, K. D., Kamper, H., & Livescu, K. (2017). Query-by-example search with discriminative neural acoustic word embeddings. In F. Lacerda (Ed.), 18th annual conference of the international speech communication association (Interspeech) (pp. 2874–2878). ISCA. https://doi.org/10.21437/INTERSPEECH.2017-1592
Chapter Google Scholar
Settle, S., & Livescu, K. (2016). Discriminative acoustic word embeddings: Tecurrent neural network-based approaches. In 2016 IEEE spoken language technology workshop (SLT) (pp. 503–510). https://doi.org/10.1109/SLT.2016.7846310
Svec, J., Lehecka, J., & Smídl, L. (2022). Deep LSTM spoken term detection using wav2vec 2.0 recognizer. In H. Ko & J. H. L. Hansen (Eds.), 23rd annual conference of the international speech communication association (Interspeech) (pp. 1886–1890). ISCA. https://doi.org/10.21437/INTERSPEECH.2022-10409
Chapter Google Scholar
Vasudev, D., Vasudev, S. V., Anish Babu, K. K., & Riyas, K. S. (2016). Combined MFCC-FBCC features for unsupervised query-by-example spoken term detection. In S. Berretti, S. M. Thampi, & P. R. Srivastava (Eds.), Intelligent systems technologies and applications (pp. 511–519). Springer. https://doi.org/10.1007/978-3-319-23036-8_44
Chapter Google Scholar
Wang, X., Han, X., Huang, W., Dong, D., & Scott, M. R. (2019). Multi-similarity loss with general pair weighting for deep metric learning. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5017–5025). https://doi.org/10.1109/CVPR.2019.00516
Wang, Y. H., Lee, H. Y., & Lee, L. S. (2018). Segmental audio word2vec: Representing utterances as sequences of vectors with applications in spoken term detection. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6269–6273). https://doi.org/10.1109/ICASSP.2018.8462002
Yuan, Y., Leung, C.-C., Xie, L., Chen, H., Ma, B., & Li, H. (2018). Learning acoustic word embeddings with temporal context for query-by-example speech search. In B. Yegnanarayana (Ed.), 19th annual conference of the international speech communication association (Interspeech) (pp. 97–101). ISCA. https://doi.org/10.21437/INTERSPEECH.2018-1010
Chapter Google Scholar
Zhang, K., Wu, Z., Jia, J., Meng, H. M., & Song, B. (2019). Query-by-example spoken term detection using attentive pooling networks. In 2019 Asia-Pacific signal and information processing association annual summit and conference (pp. 1267–1272). IEEE. https://doi.org/10.1109/APSIPAASC47483.2019.9023023

Download references

Author information

Theerat Sakdejayont and Tawunrat Chalothorn have contributed equally to this work.

Authors and Affiliations

Kasikorn Labs, Kasikorn Business-Technology Group, Pop Pu La Road, Nonthaburi, 11120, Thailand
Pantid Chantangphol, Theerat Sakdejayont & Tawunrat Chalothorn

Authors

Pantid Chantangphol
View author publications
You can also search for this author in PubMed Google Scholar
Theerat Sakdejayont
View author publications
You can also search for this author in PubMed Google Scholar
Tawunrat Chalothorn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pantid Chantangphol.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chantangphol, P., Sakdejayont, T. & Chalothorn, T. Enhancing spoken term detection with deep acoustic word embeddings and cross-modal matching techniques. Int J Speech Technol 27, 875–886 (2024). https://doi.org/10.1007/s10772-024-10145-1

Download citation

Received: 07 August 2024
Accepted: 01 September 2024
Published: 16 October 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s10772-024-10145-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing spoken term detection with deep acoustic word embeddings and cross-modal matching techniques

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multilingual spoken term detection: a review

Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion

Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Enhancing spoken term detection with deep acoustic word embeddings and cross-modal matching techniques

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multilingual spoken term detection: a review

Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion

Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation