Abstract
In this paper, we propose a new phone selection method to select more suitable phones with higher score for language identification (LID), which is more similar to target language. A data-driven approach is developed for the phone selection to avoid using complex semantic knowledge which benefits from significant reduction in the manual cost of learning different languages. Recently, bidirectional long short-term memory (BLSTM) can provides more accurate content frame alignments with sequence information from longer duration, which has improved automatic speech recognition (ASR) performance. In principle, the output of BLSTM based ASR contains more candidates in form of phone lattice, which can reduces adverse effect of many practical factors, such as variations of channels, noises and accents. Therefore, initial phones sequences are extracted from phone lattice firstly which are generated by speech recognition results of BLSTM based ASR system. Second, asymmetrical distance between each phone and target language is proposed and then applied to weight the initial phones sequences. Accordingly, language-related phones are selected from the weighted phones. Finally, the selected phones are used to re-score input sentences for the LID system. Intensive experiments have been conducted on AP16-OLR Challenge to validate the effectiveness of our proposed method. It can be seen from results, these selected phones are more effective to LID than the rest phones. Our method gives improvement up to 39.96% in terms of Cavg compared with method without using phone selection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Torres-Carrasquillo, P.A., Singer, E., Gleason, T., McCree, A., Reynolds, D.A., Richardson, F., Sturim, D.E.: The MITLL NIST LRE 2009 language recognition system. In: Acoustics Speech and Signal Processing (ICASSP) IEEE International Conference on 2010, pp. 4994–4997 (2010)
Gonzalez-Dominguez, J., Lopez-Moreno, I., Franco-Pedroso, J., Ramos, D., Toledano, D.T., Gonzalez-Rodriguez, J.: Multilevel and session variability compensated language recognition: ATVS-UAM systems at NIST LRE 2009. IEEE J. Sel. Top. Sig. Proc. 4(6), 1084–1093 (2010)
Ferrer, L., Scheffer, N., Shriberg, E.: A comparison of approaches for modeling prosodic features in speaker recognition. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 4414–4417 (2010)
Martinez, D., Lleida, E., Ortega, A., Miguel, A.: Prosodic features and formant modeling for an ivectorbased language recognition system. In: Acoustics, Speech and Signal Processing (ICASSP) IEEE International Conference on 2013, pp. 6847–6851 (2013)
Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D.A., Dehak, R.: Language recognition via i-vectors and dimensionality reduction. In: Interspeech ISCA, pp. 857–860 (2011)
Martinez, D., Plchot, O., Burget, L., Glembek, O., Matejka, P.: Language recognition in ivectors space. In: Interspeech ISCA, pp. 861–864 (2011)
Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J., Moreno, P.: Automatic language identification using deep neural networks. In: Acoustics, Speech and Signal Processing (ICASSP) IEEE International Conference on 2014, pp. 5337–5341 (2014)
Gonzalez-Dominguez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., Moreno, P.J.: Automatic language identification using long short-term memory recurrent neural networks. In: Interspeech, pp. 2155–2159 (2014)
Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal, A., Janda, M., Karafiat, M., Kombrink, S., Motlicek, P., Qian, Y., et al.: Generating exact lattices in the WFST framework. In: Proceedings of ICASSP, pp. 4213–4216 (2012)
Irtza, S., Sethu, V., Fernando, S., Ambikairajah, E., Li, H.: Out of set language modelling in hierarchical language identification. In: Interspeech 2016, pp. 3270–3274 (2016)
Lopez-Otero, P., Docio-Fernandez, L., Garcia-Mateo, C.: Phonetic unit selection for cross-lingual query-by-example spoken term detection. In: Automatic Speech Recognition and Understanding (ASRU) IEEE Workshop on 2015, pp. 223–229 (2015)
Wang, D., Li, L., Tang, D., Chen, Q.: AP16-OL7: a multilingual database for oriental languages and a language recognition baseline, submitted to APSIPA 2016.pdf
Graves, A., Mohamed, A., Hinton, G.E.: Speech recognition with deep recurrent neural networks. In: International Conference on Acoustics, Speech, and Signal Processing (2013)
Sak, H., Saraclar, M., Güngör, T: On-the-fly lattice rescoring for real-time automatic speech recognition. In: Interspeech, pp. 2450–2453 (2010)
Ortmanns, S., Ney, H., Aubert, X.: A word graph algorithm for large vocabulary continuous speech recognition. Comput. Speech Lang. 11, 43–72 (1997)
Irtza, S., Sethu, V., Fernando, S., Ambikairajah, E., Li,H.: Out of set language modelling in hierarchical language identification. In: Interspeech 2016, pp. 3270–3274 (2016)
Acknowledgements
This work is partially supported by Key Technologies Research & Development Program of Shenzhen (No: JSGG20150512160434776) and Key Technologies Research & Development of Data Retrieval and Monitoring via Multi-layer Network (No: JSGG20160229121006579).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Song, X., Cheng, Q., Xing, J., Zou, Y. (2018). Data-Driven Phone Selection for Language Identification via Bidirectional Long Short-Term Memory Modeling. In: Li, K., Li, W., Chen, Z., Liu, Y. (eds) Computational Intelligence and Intelligent Systems. ISICA 2017. Communications in Computer and Information Science, vol 873. Springer, Singapore. https://doi.org/10.1007/978-981-13-1648-7_26
Download citation
DOI: https://doi.org/10.1007/978-981-13-1648-7_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1647-0
Online ISBN: 978-981-13-1648-7
eBook Packages: Computer ScienceComputer Science (R0)