Latent Topic-based Subspace for Natural Language Processing

Morchid, Mohamed; Bousquet, Pierre-Michel; Kheder, Waad Ben; Janod, Killian

doi:10.1007/s11265-018-1388-1

Latent Topic-based Subspace for Natural Language Processing

Published: 20 July 2018

Volume 91, pages 833–853, (2019)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Mohamed Morchid ORCID: orcid.org/0000-0002-4427-2468¹,
Pierre-Michel Bousquet¹,
Waad Ben Kheder¹ &
…
Killian Janod¹

247 Accesses
Explore all metrics

Abstract

Natural Language Processing (NLP) applications have difficulties in dealing with automatically transcribed spoken documents recorded in noisy conditions, due to high Word Error Rates (WER), or in dealing with textual documents from the Internet, such as forums or micro-blogs, due to misspelled or truncated words, bad grammatical form… To improve the robustness against document errors, hitherto-proposed methods map these noisy documents in a latent space such as Latent Dirichlet Allocation (LDA), supervised LDA and author-topic (AT) models. In comparison to LDA, the AT model considers not only the document content (words), but also the class related to the document. In addition to these high-level representation models, an original compact representation, called c-vector, has recently been introduced avoid the tricky choice of the number of latent topics in these topic-based representations. The main drawback in the c-vector space building process is the number of sub-tasks required. Recently, we proposed both improving the performance of this c-vector compact representation of spoken documents and reducing the number of needed sub-tasks, using an original framework in a robust low dimensional space of features from a set of AT models called “Latent Topic-based Subspace” (LTS). This paper goes further by comparing the original LTS-based representation with the c-vector technique as well as with the state-of-the-art compression approach based on neural networks Encoder-Decoder (Autoencoder) and classification methods called deep neural networks (DNN) and long short-term memory (LSTM), on two classification tasks using noisy documents taking the form of speech conversations but also with textual documents from the 20-Newsgroups corpus. Results show that the original LTS representation outperforms the best previous compact representations with a substantial gain of more than 2.1 and 3.3 points in terms of correctly labeled documents compared to c-vector and Autoencoder neural networks respectively. An optimization algorithm of the scoring model parameters is then proposed to improve both the robustness and the performance of the proposed LTS-based approach. Finally, an automatic clustering approach based on the radial proximity between documents classes is introduced and shows promising performances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Long-Distance Continuous Space Language Modeling for Speech Recognition

An integrated clustering and BERT framework for improved topic modeling

Article 01 April 2023

Text classification based on the word subspace representation

Article 12 March 2021

Notes

The Universal Background Model (UBM) UBM is a GMM (Gaussian Mixture Model) that represents all the possible observations.
The name “bottleneck” is employed to better understand that features are extracted from the middle hidden layer even if this layer has a size greater or equal to other layers.
The UBM is a GMM that represents all the possible observations.
http://code.google.com/p/stop-words/
http://qwone.com/~jason/20Newsgroups/
http://qwone.com/~jason/20Newsgroups/

References

Abdi, H., & Williams, L.J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433–459.
Article Google Scholar
Albishre, K., Albathan, M., Li, Y. (2015). Effective 20 newsgroups dataset cleaning. In 2015 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI-IAT) (Vol. 3, pp. 98–101). IEEE.
Bechet, F., Maza, B., Bigouroux, N., Bazillon, T., El-Beze, M., De Mori, R., Arbillot, E. (2012). Decoda: a call-centre human-human spoken conversation corpus. LREC’12.
Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends®;, in Machine Learning, 2(1), 1–127.
Article MathSciNet MATH Google Scholar
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the python for scientific computing conference (SciPy). Oral Presentation.
Blei, D.M., & McAuliffe, J.D. (2010). Supervised topic models. arXiv:1003.0783.
Blei, D.M., Ng, A.Y., Jordan, M.I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Bouallegue, M., Morchid, M., Dufour, R., Driss, M., Linarès, G., De Mori, R. (2014). Subspace Gaussian mixture models for dialogues classification. In Conference of the international speech communication association (interspeech) 2014. ISCA.
Bousquet, P.M., Matrouf, D., Bonastre, J.F. (2011). Intersession compensation and scoring methods in the i-vectors space for speaker recognition. In Interspeech (pp. 485–488).
De Boer, P.T., Kroese, D.P., Mannor, S., Rubinstein, R.Y. (2005). A tutorial on the cross-entropy method. Annals of Operations Research, 134(1), 19–67.
Article MathSciNet MATH Google Scholar
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.
Article Google Scholar
Eisenstein, J., & Barzilay, R. (2008). Bayesian unsupervised topic segmentation. In: Proceedings of the conference on empirical methods in natural language processing (pp. 334–343). ACL.
Golub, G.H., & Reinsch, C. (1970). Singular value decomposition and least squares solutions. Numerische Mathematik, 14(5), 403–420.
Article MathSciNet MATH Google Scholar
Hazen, T. (2011). Topic identification. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, 12, 319–356.
Hinton, G.E., Osindero, S., Teh, Y.W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Article MathSciNet MATH Google Scholar
Hinton, G.E., & Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Article MathSciNet MATH Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Kamvar, S.D., Klein, D., Manning, C.D. Incremental spectral classification for weakly supervised text learning http://www.ai.mit.edu/jrennie/20Newsgroups/.
Kamvar, S.D., Klein, D., Manning, C.D. Incremental spectral classification for weakly supervised text learning.
Killian, J., Morchid, M., Dufour, R., Linarès, G. (2016). A log-linear weighting approach in the word2vec space for spoken language understanding. In Spoken language technology workshop (SLT), 2016 IEEE (pp 356–361). IEEE.
Lagus, K., & Kuusisto, J. (2002). Topic identification in natural language dialogues using neural networks. In Proceedings of the third SIGdial workshop on discourse and dialogue. https://doi.org/10.3115/1118121.1118135. http://www.aclweb.org/anthology/W02-1014 (pp. 95–102). Philadelphia: Association for Computational Linguistics.
LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
Article Google Scholar
Linares, G., Nocéra, P., Massonie, D., Matrouf, D. (2007). The lia speech recognition system: from 10xrt to 1xrt. In Text, speech and dialogue (pp. 302–308). Springer.
Matrouf, D., Scheffer, N., Fauve, B.G., Bonastre, J.F. (2007). A straightforward and efficient implementation of the factor analysis model for speaker verification. In Interspeech (pp. 1242–1245).
Melamed, I., & Gilbert, M. (2011). Speech analytics. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, 14, 397–416.
Mikolov, T., Corrado, G., Chen, K., Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of the international conference on learning representations (ICLR) 2013 (pp. 1–12).
Mohamed, A., Dahl, G., Hinton, G. (2009). Deep belief networks for phone recognition. In Nips workshop on deep learning for speech recognition and related applications.
Mohamed, A.R., Yu, D., Deng, L. (2010). Investigation of full-sequence training of deep belief networks for speech recognition. In INTERSPEECH (pp. 2846–2849).
Morchid, M. (2017). Internal memory gate for recurrent neural networks with application to spoken language understanding. In Proceedings of interspeech 2017 (pp. 3316–3319).
Morchid, M., Bouallegue, M., Dufour, R., Linarès, G., Matrouf, D., De Mori, R. (2015). Compact multiview representation of documents based on the total variability space. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(8), 1295–1308.
Article Google Scholar
Morchid, M., Bouaziz, M., Ben Khedder, W., Janod, K., Bousquet, P.M., Dufour, R., Linarès, G. (2016). Spoken language understanding in a latent topic-based subspace. In Conference of the international speech communication association (INTERSPEECH) 2016. ISCA.
Morchid, M., Dufour, R., Bouallegue, M., Linarès, G. (2014). Author-topic based representation of call-center conversations. In International spoken language technology workshop (SLT) 2014. IEEE.
Morchid, M., Dufour, R., Bouallegue, M., Linarès, G., De Mori, R. (2014). Theme identification in human-human conversations with features from specific speaker type hidden spaces. In Conference of the international speech communication association (interspeech) 2014. ISCA.
Morchid, M., Dufour, R., Bousquet, P.M., Bouallegue, M., Linarès, G., De Mori, R. (2014). Improving dialogue classification using a topic space representation and a Gaussian classifier based on the decision rule. In ICASSP. IEEE.
Morchid, M., Dufour, R., Linarès, G. (2016). Impact of word error rate on theme identification task of highly imperfect human–human conversations. Computer Speech & Language, 38, 68–85.
Article Google Scholar
Morchid, M., Dufour, R., Linarès, G., Hamadi, Y. (2015). Latent topic model based representations for a robust theme identification of highly imperfect automatic transcriptions. In International conference on intelligent text processing and computational linguistics (CICLing) 2015.
Purver, M. (2011). Topic segmentation. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, 291–317.
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of the 20th conference on uncertainty in artificial intelligence (pp. 487–494). AUAI Press.
Rubinstein, R.Y. (1997). Optimization of computer simulation models with rare events. European Journal of Operational Research, 99(1), 89–112.
Article MathSciNet Google Scholar
Salakhutdinov, R., Mnih, A., Hinton, G. (2007). Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on machine learning (pp. 791–798). ACM.
Srivastava, N., Salakhutdinov, R.R., Hinton, G.E. (2013). Modeling documents with deep boltzmann machines. arXiv:1309.6865.
Tur, G., & De Mori, R. (2011). Spoken language understanding: systems for extracting semantic information from speech. New York: Wiley.
Book MATH Google Scholar
Van Asch, V. (2013). Macro-and micro-averaged evaluation measures [[basic draft]].
Yin, P.Y. (2007). Multilevel minimum cross entropy threshold selection based on particle swarm optimization. Applied Mathematics and Computation, 184(2), 503–513.
Article MathSciNet MATH Google Scholar
Yu, D., Deng, L., Wang, S. (2009). Learning in the deep-structured conditional random fields. In Proceedings of NIPS workshop (pp. 1–8).
Yu, D., Wang, S., Karam, Z., Deng, L. (2010). Language recognition using deep-structured conditional random fields. In 2010 IEEE international conference on acoustics speech and signal processing (ICASSP) (pp. 5030–5033). IEEE.

Download references

Author information

Authors and Affiliations

Laboratoire Informatique d’Avignon (LIA), University of Avignon, Avignon, France
Mohamed Morchid, Pierre-Michel Bousquet, Waad Ben Kheder & Killian Janod

Authors

Mohamed Morchid
View author publications
You can also search for this author in PubMed Google Scholar
Pierre-Michel Bousquet
View author publications
You can also search for this author in PubMed Google Scholar
Waad Ben Kheder
View author publications
You can also search for this author in PubMed Google Scholar
Killian Janod
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Morchid.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Morchid, M., Bousquet, PM., Kheder, W.B. et al. Latent Topic-based Subspace for Natural Language Processing. J Sign Process Syst 91, 833–853 (2019). https://doi.org/10.1007/s11265-018-1388-1

Download citation

Received: 08 September 2017
Revised: 11 April 2018
Accepted: 01 June 2018
Published: 20 July 2018
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s11265-018-1388-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Topic-based Subspace for Natural Language Processing

Abstract

Access this article

Similar content being viewed by others

Long-Distance Continuous Space Language Modeling for Speech Recognition

An integrated clustering and BERT framework for improved topic modeling

Text classification based on the word subspace representation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Latent Topic-based Subspace for Natural Language Processing

Abstract

Access this article

Similar content being viewed by others

Long-Distance Continuous Space Language Modeling for Speech Recognition

An integrated clustering and BERT framework for improved topic modeling

Text classification based on the word subspace representation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation