Using Related Text Sources to Improve Classification of Transcribed Speech Data

Shrestha, Niraj; Moons, Elias; Moens, Marie-Francine

doi:10.1007/978-3-030-14118-9_51

Using Related Text Sources to Improve Classification of Transcribed Speech Data

Niraj Shrestha¹⁹,
Elias Moons¹⁹ &
Marie-Francine Moens¹⁹

Conference paper
First Online: 17 March 2019

1939 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 921))

Abstract

Today’s content including user generated content is increasingly found in multimedia format. It is known that speech data are sometimes incorrectly transcribed especially when they are spoken by voices on which the transcribers have not been trained or when they contain unfamiliar words. A familiar mining tasks that helps in storage, indexing and retrieval is automatic classification with predefined category labels. Although state-of-the-art classifiers like neural networks, support vector machines (SVM) and logistic regression classifiers perform quite satisfactory when categorizing written text, their performance degrades when applied on speech data transcribed by automatic speech recognition (ASR) due to transcription errors like insertion and deletion of words, grammatical errors and words that are just transcribed wrongly. In this paper, we show that by incorporating content from related written sources in the training of the classification model has a benefit. We especially focus on and compare different representations that make this integration possible, such as representations of speech data that embed content from the written text and simple concatenation of speech and written content. In addition, we qualitatively demonstrate that these representations to a certain extent indirectly correct the transcription noise.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
We have tested several other models such as a support vector machine and a Naive Bayes classifier, but the results did not change substantially.
2.
If a named entity consists of more than one token, then we convert it to a single token representation by joining its components with underscore (“_”), for example “new york” is converted to “new_york”. Thus “new_york” is a single feature rather than two different features when we use “new” and “york”.
3.
https://code.google.com/archive/p/word2vec/.
4.
http://abcnews.go.com/.
5.
http://www.newsy.com/.
6.
We will make the speech transcriptions of the collected dataset available after publication including its split in train and test data.

References

Collell, G., Zhang, T., Moens, M.: Imagined visual representations as multimodal embeddings. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 4378–4384 (2017)
Google Scholar
Huang, R., Hansen, J.H.: Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora. In: IEEE Transactions on Audio, Speech, and Language Processing, pp. 907–919 (2006)
Google Scholar
Siegler, M.A., Jain, U., Raj, B., Stern, R.M.: Automatic segmentation, classification and clustering of broadcast news audio. In: Proceedings DARPA Speech Recognition Workshop, pp. 97–99 (1997)
Google Scholar
Castán, D., Ortega, A., Miguel, A., Lleida, E.: Audio segmentation-by-classification approach based on factor analysis in broadcast news domain. EURASIP J. Audio Speech Music Process. 2014, 34 (2014)
Article Google Scholar
Atrey, P.K., Maddage, N.C., Kankanhalli, M.S.: Audio based event detection for multimedia surveillance. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 5 (2006)
Google Scholar
Jiang, Y., Zeng, X., Ye, G., Ellis, D., Chang, S., Bhattacharya, S., Shah, M.: Columbia-UCF TRECVID2010 multimedia event detection: combining multiple modalities, contextual concepts, and temporal matching. In: TRECVID, National Institute of Standards and Technology (NIST) (2010)
Google Scholar
Schwartz, R.M., Imai, T., Kubala, F., Nguyen, L., Makhoul, J.: A maximum likelihood model for topic classification of broadcast news. In: Kokkinakis, G., Fakotakis, N., Dermatas, E. (eds.) EUROSPEECH, ISCA (1997)
Google Scholar
Chen, K., Liu, S., Chen, B., Wang, H., Jan, E., Hsu, W., Chen, H.: Extractive broadcast news summarization leveraging recurrent neural network language modeling techniques. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1322–1334 (2015)
Article Google Scholar
Djuric, N., Zhou, J., Morris, R., Grbovic, M., Radosavljevic, V., Bhamidipati, N.: Hate speech detection with comment embeddings. In: Proceedings of the 24th International Conference on World Wide Web, pp. 29–30. ACM (2015)
Google Scholar
CMU: CMU sphinx toolbox. “CMU” (2016). https://cmusphinx.github.io/wiki/download/
FFmpeg: FFmpeg tool (2016). http://ffmpeg.org/

Download references

Author information

Authors and Affiliations

Department of Computer Science, KU Leuven, Leuven, Belgium
Niraj Shrestha, Elias Moons & Marie-Francine Moens

Authors

Niraj Shrestha
View author publications
You can also search for this author in PubMed Google Scholar
Elias Moons
View author publications
You can also search for this author in PubMed Google Scholar
Marie-Francine Moens
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niraj Shrestha .

Editor information

Editors and Affiliations

Faculty of Computers and Information, Cairo University, Giza, Egypt
Aboul Ella Hassanien
Faculty of Computers and Information, Benha University, Benha, Egypt
Ahmad Taher Azar
School of Computing, Science and Engineering, University of Salford, Salford, Greater Manchester, UK
Tarek Gaber
Department of Computer Science and Engineering, School of Computing and IT, Faculty of Engineering, Manipal University Jaipur, Jaipur, Rajasthan, India
Roheet Bhatnagar
Faculty of Computer and Information Science, Ain Shams University, Cairo, Egypt
Mohamed F. Tolba

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shrestha, N., Moons, E., Moens, MF. (2020). Using Related Text Sources to Improve Classification of Transcribed Speech Data. In: Hassanien, A., Azar, A., Gaber, T., Bhatnagar, R., F. Tolba, M. (eds) The International Conference on Advanced Machine Learning Technologies and Applications (AMLTA2019). AMLTA 2019. Advances in Intelligent Systems and Computing, vol 921. Springer, Cham. https://doi.org/10.1007/978-3-030-14118-9_51

Download citation

DOI: https://doi.org/10.1007/978-3-030-14118-9_51
Published: 17 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14117-2
Online ISBN: 978-3-030-14118-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics