Abstract
Much information in multimedia data related to terrorist activity can be extracted from the audio content. Our work in ongoing projects aims to provide a complete description of the audio portion of multimedia documents. The information that can be extracted can be derived from diarization, classification of acoustic events, language and speaker segmentation and clustering, as well as automatic transcription of the speech portions. An important consideration is ensuring that the audio processing technologies are well suited to the types of data of interest to the law enforcement agencies. While language identification and speech recognition may be considered as ’mature technologies’, our experience is that even state-of-the-art systems require customisation and enhancements to address the challenges of terrorist-related audio documents.
This work was partially financed by the Horizon 2020 project DANTE - Detecting and analysing terrorist-related online contents and financing activities and the French National Agency for Research as part of the SALSA project (Speech and Language technologies for Security Applications) under grant ANR-14-CE28-0021.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Vu, N.T. et al.: A first speech recognition system for Mandarin-English code-switch conversational speech. In: IEEE ICASSP (2012)
Gauvain, J.L., Lamel, L., Adda, G.: Audio partitioningt and transcription for broadcast data indexation. Multimed. Tools Appl. 14, 187–200 (2001)
House, A.S., Neuburg, E.P.: Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. JASA 62(3), 708–713 (1977)
Gauvain, J.L., Lamel, L.: Identification of non-linguistic speech features. In: Human Language Technology (HLT 1993), pp. 96–101. ACL (1993)
Lamel, L., Gauvain, J.L.: A phone-based approach to non-linguistic speech feature identification. Comput. Speech Lang. 9(1), 87–103 (1995). https://doi.org/10.1006/csla.1995.0005
Zissman, M.: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio 4, 31–44 (1996)
Benzeghiba, M. Gauvain, J.L., Lamel, L.: Improved n-gram phonotactic models for language recognition. In: Interspeech (2010)
Kadambe, S., Hieronymus, J.: Language identification with phonological and lexical models. In: IEEE ICASSP (1995)
Gauvain, J.L., Messaoudi, A., Schwenk, H.: Language recognition using phone lattices. In: ICSLP, pp. 1283–1286, Jeju Island (2004)
Dehak, N. et al.: Language recognition via i-vectors and dimensionality reduction. In: Interspeech, pp. 857–860, Florence (2011)
Martinez, D. et al.: Language recognition in iVectors space. In: Interspeech (2011)
Hinton, G., et al.: Deep neural networks foracoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Weinreich, U.: Languages in Contact. Mouton, The Hague (1953)
Demby, G.: How code-switching explains the world (2013)
Amazouz, D., Adda-Decker, M, Lamel, L.: Addressing code-switching in French/Algerian Arabic speech. In: Proceedings of Interspeech 2017, pp. 62–66 (2017)
Jelinek, F.: Continuous speech recognition by statistical methods. Proc. IEEE 64, 532–556 (1976)
Schwartz, R. et al.: Improved hidden Markov modeling of phonemes for continuous speech recognition. In: IEEE ICASSP, vol. 3, pp. 35.6.1–35.6.4 (1984)
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Interspeech (2015)
Cui, X., Goel, V., Kingsbury, B.: Data augmentation for deep neural network acoustic modelling. In: IEEE ICASSP, pp. 5619–5623 (2014)
Ragni, A., et al.: Data augmentation for low resource languages. In: Interspeech, pp. 810–814, Singapore (2014)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE ICASSP, pp. 776–780 (2017)
Hershey, S. et al.: CNN architectures for large-scale audio classification. In: IEEE ICASSP, pp. 131–135 (2017)
Takahashi, N. et al.: Deep convolutional neural networks and data augmentation for acoustic event detection, arXiv preprint arXiv:1604.07160 (2016)
Snyder, D., Chen, G., Povey, D.: MUSAN: a music, speech, and noise corpus, CoRR abs/1510.08484 (2015). http://arxiv.org/pdf/1510.08484v1.pdf
Martin, A. Garofolo, J.: NIST speech processing evaluations: LVCSR, speaker recognition, language recognition. In: IEEE Workshop on Signal Processing Applications for Public Security and Forensics, pp. 1–7 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Gauvain, J. et al. (2019). Challenges in Audio Processing of Terrorist-Related Data. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11296. Springer, Cham. https://doi.org/10.1007/978-3-030-05716-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-05716-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05715-2
Online ISBN: 978-3-030-05716-9
eBook Packages: Computer ScienceComputer Science (R0)