Abstract
Long short-term memory (LSTM) is a state-of-the-art network used for different tasks related to natural language processing (NLP), pattern recognition, and classification. It has been successfully used for speech recognition and speaker identification as well. The amount of training data and the ratio of training to test data are still the key factors for achieving good results, but have their implications on the real usage. The main contribution of this paper is to achieve a high rate of speaker recognition for text-independent continuous speech using small ratio of training to test data, by applying long short-term memory recursive neural network. A comparison with the probabilistic feed-forward neural network has been made for speaker recognition as well as gender and language identification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2nd edn. Pearson Prentice Hall (2008)
Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57(2016), 345–420 (2016)
Saeed, K., Nammous, M.K.: A speech-and-speaker identification system: feature extraction, description, and classification of speech-signal image. IEEE Trans. Ind. Electron. 54(2), 887–897 (2007)
Nammous, M.K., Szczepanski, A., Saeed, K.: An exploratory research on text-independent speaker recognition. In: HAIS, Part 1, pp. 412–419 (2011)
Ahmed, H., Elaraby, M.S., Moussa, A.M., Abdallah, M., Abdou, S.M., Rashwan, M.: An unsupervised speaker clustering technique based on SOM and I-vectors for speech recognition systems. In: The Third Arabic Natural Language Processing Workshop, EACL, Valencia, Spain (2017)
Sarria-Paja, M., Falk, T.H.: Variants of mel-frequency cepstral coefficients for improved whispered speech speaker verification in mismatched conditions. In: 25th European Signal Processing Conference (EUSIPCO) (2017)
Lopez-Otero, P., Docio-Fernandez, L., Garcia-Mateo, C.: I-vectors for continuous emotion recognition. Training 45, 50 (2014)
Bahari, M.H., Mclaren, M., Van Hamme, H., Van Leeuwen, D.A.: Speaker age estimation using I-vectors. Eng. Appl. Artif. Intell. 34, 99–108 (2014)
Motlicek, P., Dey, S., Madikeri, S., Burget, L.: Employment of subspace gaussian mixture models in speaker recognition. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, pp. 4445–4449 (2015)
Saeed, K.: Carathéodory–Toeplitz based mathematical methods and their algorithmic applications in biometric image processing. Appl. Numer. Math. 75, 2–21 (2014)
Specht, D.F.: Probabilistic neural networks and the polynomial adaline as complementary techniques for classification. IEEE Trans. Neural Netw. 1, 11–121 (1990)
Low, R., Togneri, R.: Speech recognition using the probabilistic neural network. In: Proceedings of ICSLP98 (1998)
Phan, H., Koch, P., Katzberg, F., Maass, M., Mazur, R., Mertins, A.: Audio scene classification with deep recurrent neural networks (2017). arXiv:1703.04770
Qawaqneh, Z., Mallouh, A.A., Barkana, B.D.: Deep neural network framework and transformed MFCCs for speaker’s age and gender classification. Knowl. Based Syst. 115, 5–14 (2017)
Becerra, A., de la Rosa, J.I., González, E.: Speech recognition in a dialog system: from conventional to deep processing. In: Multimedia Tools and Applications, pp. 1–37. Springer (2017)
López Moreno, I.: Deep neural network architectures for large-scale, robust and small-footprint speaker and language recognition. Ph.D. thesis. Universidad Politécnica de Madrid (2017)
Bell, P., Gales, M., Hain, T., Kilgour, J., Lanchantin, P., Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., Woodland, P.: The MGB challenge: evaluating multi-genre broadcast media recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 687–693. IEEE (2015)
Feng, L., Hansen, L.K.: A new database for speaker recognition. Technical report (2005)
McLaren, M., Ferrer, L., Castán, D., Lawson, A.: The speakers in the wild (SITW) speaker recognition database. In: INTERSPEECH, vol. 2016, pp. 818–822 (2016)
Woo, R.H., Park, A., Hazen, T.J.: The MIT mobile device speaker verification corpus: data collection and preliminary experiments. In: The Speaker and Language Recognition Workshop (2006)
Godfrey, J.J., Holliman, E.C., McDaniel, J.: Switchboard: telephone speech corpus for research and development. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 517–520. IEEE (1992)
Greenberg, C.S.: The NIST year 2012 speaker recognition evaluation plan. NIST, Technical report (2012)
Poignant, J., Besacier, L., Quénot, G.: Unsupervised speaker identification in TV broadcast based on written names. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1) (2015)
Nagraniy, A., Chungy, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
Nammous M., Saeed K.: Voice-print and text-independent speaker identification. In: International Conference on Electrical Engineering Design and Technologies—ICEEDT’07, 1 Jan 2007. International Conference on Electrical Engineering Design and Technologies—ICEEDT’08, Tunisia (2007)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York, NY (2006)
Kusy, M., Zajdel, R.: Probabilistic neural network training procedure based on Q(0)-learning algorithm in medical data classification. Appl. Intell. 41, 837–854 (2014)
Specht, D.F.: Probabilistic neural networks. Neural Netw. 3(1), 109–118 (1990)
Lewicki, P., Hill, T.: Statistics: Methods and Applications: a Comprehensive Reference for Science, Industry, and Data Mining. StatSoft Inc, Tulsa, OK (2006)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Zhou, P., Qi, Z., Zheng, S., Xu, J., Bao, H., Xu, B.: Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In: COLING 2016, pp. 3485–3495 (2016)
Lu, Y., Lu, C., Tang, C.-K.: Online video object detection using association LSTM. In: The IEEE International Conference on Computer Vision (ICCV), pp. 2344–2352 (2017)
Akopyan, M., Khashba, E.: Large-scale YouTube-8M video understanding with deep neural networks (2017). arXiv:1706.04488
Xu, J., Chen, D., Qiu, X., Huang, X.: Cached long short-term memory neural networks for document-level sentiment classification. In: EMNLP 2016, pp. 1660–1669 (2016)
Lu, L., Renals, S.: Small-footprint highway deep neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1502–1511 (2017)
Chen, J., Wang, D.L.: Long short-term memory for speaker generalization in supervised speech separation. In: INTERSPEECH, pp. 3314–3318 (2016)
Saeed, K., Adamski, M., Bhattasali, T., Nammous, M.K., Panasiuk, P., Rybnik, M., Shaikh, S.H.: New Directions in Behavioral Biometrics. CRC Press (2016)
Acknowledgements
This work was supported by grant S/WI/3/2018 from Bialystok University of Technology and funded with resources for research by the Ministry of Science and Higher Education in Poland.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Nammous, M.K., Saeed, K. (2019). Natural Language Processing: Speaker, Language, and Gender Identification with LSTM. In: Chaki, R., Cortesi, A., Saeed, K., Chaki, N. (eds) Advanced Computing and Systems for Security. Advances in Intelligent Systems and Computing, vol 883. Springer, Singapore. https://doi.org/10.1007/978-981-13-3702-4_9
Download citation
DOI: https://doi.org/10.1007/978-981-13-3702-4_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-3701-7
Online ISBN: 978-981-13-3702-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)