Abstract
Speech is an information-rich component of multimedia. Information can be extracted from a speech signal in a number of different ways, and thus there are several well-established speech signal analysis research fields. These fields include speech recognition, speaker recognition, event detection, and fingerprinting. The information that can be extracted from tools and methods developed in these fields can greatly enhance multimedia systems. In this paper, we present the current state of research in each of the major speech analysis fields. The goal is to introduce enough background for someone new in the field to quickly gain high-level understanding and to provide direction for further study.
Similar content being viewed by others
References
Adami, A., Mihaescu, R., Reynolds, D., Godfrey, J.: Modeling prosodic dynamics for speaker recognition. In: Proceedings of the ICASSP, vol. 4, pp. 788–791 (2003)
Allamanche, E., Herre, J., Hellmuth, O., Fröba, B., Kastner, T., Cremer, M.: Content-based identification of audio material using MPEG-7 low level description. In: Proceedings of the International Symposium of Music Information Retrieval (2001)
Allegro, S., Buchler, M., Launer, S.: Automatic sound classification inspired by auditory scene analysis. In: Consistent and Reliable Acoustic Cues for Sound Analysis (CRAC), One-Day Workshop, Aalborg, Denmark (2001)
Al-Sawalmeh, W., Daqrouq, K., Daoud, O., Al-Qawasmi, A.: Speaker identification system-based mel frequency and wavelet transform using neural network classifier. Eur. J. Sci. Res. 41(4), 515–525 (2010)
Anguera, X., Wooters, C., Pardo, J.: Robust speaker diarization for meetings: ICSI RT06s evaluation system. In: Ninth International Conference on Spoken Language Processing (2006)
Azmi, M., Tolba, H., Mahdy, S., Fashal, M.: Syllable-based automatic Arabic speech recognition. In: Proceedings of the 7th WSEAS International Conference on Signal Processing, Robotics and Automation, pp. 246–250. World Scientific and Engineering Academy and Society (WSEAS), Greece (2008)
Baker, J., Deng, L., Glass, J., Khudanpur, S., Lee, C., Morgan, N., O’Shaugnessy, D.: Research developments and directions in speech recognition and understanding. Part 1. IEEE Signal Process. Mag. 26(3), 75–80 (2009)
Barbu, T.: A supervised text-independent speaker recognition approach. World Acad. Sci. Eng. Technol. 33 (2007)
Barras, C., Zhu, X., Meignier, S., Gauvain, J.: Improving speaker diarization. In: RT-04F Workshop (2004)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4) (2002)
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., Wellekens, C.: Automatic speech recognition and speech variability: a review. Speech Commun. 49(10–11), 763–786 (2007)
Bimbot, F., Bonastre, J.F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcıa, J.: A tutorial on text-independent speaker verification. EURASIP J. Appl. Signal Process. 4, 430–451 (2004)
Bonastre, J., Wils, F., Meignier, S.: ALIZE, a free toolkit for speaker recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), Philadelphia, USA, pp. 737–740 (2005)
Bonastre, J., Scheffer, N., Matrouf, D., Fredouille, C., Larcher, A., Preti, A., Pouchoulin, G., Evans, N., Fauve, B., Mason, J.: ALIZE/SpkDet: a state-of-the-art open source software for speaker recognition. In: Odyssey-The Speaker and Language Recognition Workshop (2008)
Brill, E.: Discovering the lexical features of a language. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, pp. 339–340. Association for Computational Linguistics (1991)
Brümmer, N., du Preez, J.: Application-independent evaluation of speaker detection. Comput. Speech Lang. 20(2–3), 230–275 (2006)
Burges, C., Platt, J., Jana, S.: Distortion discriminant analysis for audio fingerprinting. IEEE Trans. Speech Audio Process. 11(3), 165–174 (2003)
Camastra, F., Vinciarelli, A., Yu, J.: Machine learning for audio, image and video analysis. J. Electron. Imaging 18, 029901 (2009)
Campbell, J., Reynolds, D., Dunn, R.: Fusing high-and low-level features for speaker recognition. In: Eighth European Conference on Speech Communication and Technology (2003)
Campbell, W., Sturim, D., Reynolds, D.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5) (2006)
Campbell, J.P., Shen, W., Campbell, W.M., Schwartz, R., Bonastre, J.F., Matrouf, D.: Forensic speaker recognition. Signal Process. Mag. IEEE 26(2009), 95–103 (2009)
Cano, P., Batlle, E., Kalker, T., Haitsma, J.: A review of audio fingerprinting. J. VLSI Signal Process. 41(3), 271–284 (2005)
Canseco-Rodriguez, L., Lamel, L., Gauvain, J.: Speaker diarization from speech transcripts. In: Proceedings of the ICSLP, vol. 4 (2004)
Casey, M.: General sound classification and similarity in MPEG-7. Organ. Sound 6(2), 153–164 (2002)
Cohen, L.: Time frequency distributions—a review. In: Proceedings of the IEEE, vol. 77 (1989)
de Jong, F., Gauvain, J.L., Hiemstra, D., Netter, K.: Language-based multimedia information retrieval. In: In 6th RIAO Conference (2000)
Dunning, T.: Statistical identification of language. Tech. Rep. MCCS 94-273, New Mexico State University (1994)
Dusan, S., Deng, L.: Estimation of articulatory parameters from speech acoustics by Kalman filtering. In: Proceedings of CITO Researcher Retreat-Hamilton (1998)
ELDA: Evaluations and Language Resources Distribution Agency (2010). http://www.elda.org/
Fauve, B.G.B., Matrouf, D., Scheffer, N., Bonastre, J.F.F., Mason, J.S.D.: State-of-the-art performance in text-independent speaker verification through open-source software. IEEE Trans. Audio Speech Lang. Process. 15(7), 1960–1968 (2007)
Ferrer, L., Scheffer, N., Shriberg, E.: A comparison of approaches for modeling prosodic features in speaker recognition. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4414–4417. IEEE, New York (2010)
Friedland, A., Vinyals, B., Huang, C., Muller, D.: Fusing short term and long term features for improved speaker diarization. In: Acoustics, Speech and Signal Processing, ICASSP 2009, pp. 4077–4080. IEEE (2009)
Fulop, S., Disner, S.: The reassigned spectrogram as a tool for voice identification. In: Proceedings of ICPhS 2007, pp. 1853–1856 (2007)
Fulop, S., Disner, S.: Advanced time-frequency displays applied to forensic speaker identification. Proc. Meet. Acoust. 6, 060008 (2009)
Gang, C., Hui, T., Xin-meng, C.: Audio segmentation via the similarity measure of audio feature vectors. Wuhan Univ. J. Nat. Sci. 10(5), 833–837 (2005)
Gannert, T.: A Speaker Verification System Under the Scope: Alize. Master’s thesis, TMH (2007)
Gravier, G., Betser, M., Ben, M.: Audio Segmentation Toolkit, release 1.2. IRISA (2010)
Haitsma, J., Kalker, T.: A highly robust audio fingerprinting system with an efficient search strategy. J. New Music Res. 32(2), 211–221 (2003)
Haitsma, J., Kalker, T., Oostveen, J.: Robust audio hashing for content identification. In: Proceedings of the Content-Based Multimedia Indexing (2001)
Hansen, J., Bou-Ghazale, S., Sarikaya, R., Pellom, B.: Getting started with the SUSAS: speech under simulated and actual stress database. In: Robust Speech Processing Laboratory (1998)
Hansen, J.H., Gavidia-Ceballos, L., Kaiser, J.F.: A nonlinear operator-based speech feature analysis method with application to vocal fold pathology assessment. In: IEEE Transactions on Biomedical Engineering (1998)
Harris, F.: On the use of windows for harmonic analysis with the discrete Fourier transform. Proc. IEEE 66(1), 51–83 (1978)
Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)
Heymann, M.: sound: A sound interface for R. R package version 1.3 (2010). http://CRAN.R-project.org/package=soun
Huijbregts, M.: Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled. PrintPartners Ipskamp, Enschede (2008)
ISIP: Automatic speech recognition (2010). http://www.isip.piconepress.com/projects/speech/index.html
Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1997)
Jiang, D.N., Cai, L.H.: Speech emotion classification with the combination of statistic features and temporal features. In: IEEE International Conference on Multimedia and Expo (2004)
Jin, Q.: Robust Speaker Recognition. Ph.D. thesis, Carnegie Mellon University (2007)
Kajarekar, S., Ferrer, L., Stolcke, A., Shriberg, E.: Voice-based speaker recognition combining acoustic and stylistic features. In: Advances in Biometrics, pp. 183–201 (2008)
Kenny, P., Boulianne, G., Dumouchel, P.: Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13(3), 345–354 (2005)
Kimura, A., Kashino, K., Kurozumi, T., Murase, H.: Very quick audio searching: introducing global pruning to the time-series active search. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 3 (2001)
Kinnunen, T., Li, H.: An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
Kinnunen, T.: Spectral features for automatic text-independent speaker recognition. Ph. Lic. thesis, University of Joensuu, Department of Computer Science (2004)
Larcher, A., Lévy, C., Matrouf, D., Bonastre, J.: LIA NIST-SRE’10 systems. Unpublished (2010)
LDC: Language Data Consortium (2010). http://www.ldc.upenn.edu/
Lee, A., Kawahara, T., Takeda, K., Mimura, M., Yamada, A., Ito, A., Itou, K., Shikano, K.: Continuous speech recognition consortium—an open repository for CSR tools and models. In: Proceedings of the IEEE International Conference on Language Resources and Evaluation (2002)
Lee, C.H.: Back to speech science-towards a collaborative ASR community of the 21st century. In: Dynamics of Speech Production and Perception, p. 221 (2006)
Li, S.: Content-based audio classification and retrieval using the nearest feature line method. IEEE Trans. Speech Audio Process. 8(5), 619–625 (2002)
Li, D., Sethi, I., Dimitrova, N., McGee, T.: Classification of general audio data for content-based retrieval. Pattern Recognit. Lett. 22(5), 533–544 (2001)
Li, X., Tao, J., Johnson, M.T., Soltis, J., Savage, A., Leong, K.M., Newman, J.D.: Stress and emotion classification using Jitter and Shimmer features. In: IEEE International Conference on Acoustics Speech and Signal Processing (2007)
Li, H., Ma, B., Lee, C.: A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Lang. Process. 15(1), 271–284 (2007)
Linguistic Data Consortium (2010). http://www.ldc.upenn.edu/
Liscombe, J., Riccardi, G., Hakkaini-Tür, D.: Using context to improve emotion detection in spoken dialog systems. In: Proceedings of Interspeech (2005)
Low, L.S.A., Maddage, N.C., Lech, M., Sheeber, L.B., Allen, N.B.: Detection of clinical depression n adolescents’ speech during family interactions. In: IEEE Transactions on Biomedical Engineering (2011)
Lu, L., Zhang, H., Li, S.: Content-based audio classification and segmentation by using support vector machines. Multimed. Syst. 8(6), 482–492 (2003)
Lu, H., Pan, W., Lane, N., Choudhury, T., Campbell, A.: SoundSense: scalable sound sensing for people-centric applications on mobile phones. In: Proceedings of the 7th International Conference on Mobile Systems, Applications, and Services, pp. 165–178. ACM, New York (2009)
Ma, G., Zhou, W., Zheng, J., You, X., Ye, W.: A comparison between HTK and SPHINX on Chinese Mandarin. In: Proceedings of the 2009 International Joint Conference on Artificial Intelligence, pp. 394–397. IEEE Computer Society, New York (2009)
Makhoul, J.: Information extraction from speech. In: Spoken Language Technology Workshop, 2006, p. 3. IEEE, New York (2007)
Meignier, S., Moraru, D., Fredouille, C., Bonastre, J., Besacier, L.: Step-by-step and integrated approaches in broadcast news speaker diarization. Comput. Speech Lang. 20(2–3), 303–330 (2006)
Meignier, S., Merlin, T.: Lium SpkDiarization: an open source toolkit for diarization. In: CMU SPUD Workshop (2010)
Meinedo, H., Neto, J.: Audio segmentation, classification and clustering in a broadcast news task. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 2. IEEE, New York (2003)
Milner, B., Shao, X.: Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model. In: Seventh International Conference on Spoken Language Processing (2002)
Miotto, R., Orio, N.: Automatic identification of music works through audio matching. In: ECDL (2007)
Moore, E. II, Clements, M.A., Peifer, J.W., Weisser, L.: Critical analysis of the impact of glottal features in the classification of clinical depression in speech. In: IEEE Transactions on Biomedical Engineering (2008)
Nexidia: Nexidia Rich Media (2010). http://www.nexidia.com/solutions/rich_media
NIST: Nist Language Recognition Evaluation (2010). http://www.itl.nist.gov/iad/mig/tests/lre/
NIST: Nist Speaker Recognition Evaluation (2010). http://www.itl.nist.gov/iad/mig//tests/sre/
NIST: Rich Transcription Evaluation Project (2010). http://www.itl.nist.gov/iad/mig//tests/rt/
Nwe, T.L., Wei, F.S., Silva, L.D.: Speech based emotion classification. In: Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology (2001)
OLAC: Open Language Archives Community (2010). http://www.language-archives.org/
O’Shaughnessy, D.: Interacting with computers by voice: automatic speech recognition and synthesis. Proc. IEEE 91(9), 1272–1305 (2003)
Padgett, C., Cottrell, G.: Representing face images for emotion classification. In: Advances in Neural Information Processing Systems (1997)
Pallett, D.: A look at NIST’s benchmark ASR tests: past, present, and future. In: Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (2003)
Papaodysseus, C., Roussopoulos, G., Fragoulis, D., Panagopoulos, T., Alexiou, C.: A new approach to the automatic recognition of musical recordings. J. Audio Eng. Soc. 49(1/2), 23–35 (2001)
Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: A Speaker Odyssey-The Speaker Recognition Workshop (2001)
Petrovska-Delacrétaz, D., El Hannani, A., Chollet, G.: Text-independent speaker verification: state of the art and challenges. In: Progress in Nonlinear Speech Processing, pp. 135–169 (2007)
Poutsma, A.: Applying Monte Carlo techniques to language identification. In: Proceedings of Computational Linguistics in the Netherlands (CLIN) (2001)
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2010). http://www.R-project.or. ISBN 3-900051-07-0
Ramachandran, R., Farrell, K., Ramachandran, R., Mammone, R.: Speaker recognition—general classifier approaches and data fusion methods. Pattern Recognit. 35(12), 2801–2821 (2002)
Ravindran, S., Anderson, D., Slaney, M.: Low-power audio classification for ubiquitous sensor networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (2004)
Recognition Technologies, Inc. (2010). http://www.recotechnologies.com
Rehurek, R., Kolkus, M.: Language identification on the web: extending the dictionary method. Lect. Notes Comput. Sci. 5449, 357–368 (2009)
Reynolds, D.: An overview of automatic speaker recognition technology. IEEE Int. Conf. Acoust. Speech Signal Process. 4, 4072–4075 (2002)
Reynolds, D.: Channel robust speaker verification via feature mapping. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 2 (2003)
Reynolds, D., Campbell, J., Campbell, W., Dunn, R., Gleason, T., Jones, D., Quatieri, T., Quillen, C., Sturim, D., Torres-Carrasquillo, P.: Beyond cepstra: exploiting high-level information in speaker recognition. In: Proceedings of the Workshop on Multimodal User Authentication, pp. 223–229 (2003)
Reynolds, D., Torres-Carrasquillo, P.: The MIT Lincoln laboratory RT-04F diarization systems: applications to broadcast audio and telephone conversations. In: RT-04F Workshop (2004)
Reynolds, D., Torres-Carrasquillo, P.: Approaches and applications of audio diarization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05) (2005)
Rose, P.: Forensic Speaker Identification. CRC, Boca Raton (2002)
Satori, H., Hiyassat, H., Harti, M., Chenfour, N.: Investigation Arabic speech recognition using CMU Sphinx system. Int. Arab J. Inf. Technol. 6(2) (2009)
Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., Aharonson, V.: The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: INTERSPEECH (2007)
Sinha, R., Tranter, S., Gales, M., Woodland, P.: The Cambridge University March 2005 speaker diarisation system. In: Ninth European Conference on Speech Communication and Technology (2005)
Sonmez, M., Heck, L., Weintraub, M., Shriberg, E.: A lognormal tied mixture model of pitch for prosody-based speaker recognition. In: Proceedings of the Eurospeech, vol. 3, pp. 1391–1394 (1997)
Sonmez, K., Shriberg, E., Heck, L., Weintraub, M.: Modeling dynamic prosodic variation for speaker verification. In: Fifth International Conference on Spoken Language Processing (1998)
SpeecFind: Search the Speech from Last Century (2010). http://speechfind.utdallas.edu/
Stallard, D., Prasad, R., Natarajan, P.: Development and internal evaluation of speech-to-speech translation technology at BBN. In: PerMIS ’09: Proceedings of the 9th Workshop on Performance Metrics for Intelligent Systems, pp. 231–237. ACM, New York (2009). doi:10.1145/1865909.1865956
Stevens, S., Volkmann, J., Newman, E.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8, 185 (1937)
Sueur, J., Aubin, T., Simonis, C.: Seewave: a free modular tool for sound analysis and synthesis. Bioacoustics 18, 213–226 (2008). http://sueur.jerome.perso.neuf.fr/WebPage_PapersPDF/Sueuretal_Bioacoustics_2008.pdf
Sukittanon, S., Atlas, L.: Modulation frequency features for audio fingerprinting. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 2 (2002)
Switchboard: Spontaneous conversation corpus (2010). http://www.isip.piconepress.com/projects/switchboard/html/overview.html
Teager, H.: Some observations on oral air flow during phonation. In: IEEE Transactions on Acoustics, Speech and Signal Processing (1980)
Tokuhisa, R., Inui, K., Matsumoto, Y.: Emotion classification using massive examples extracted from the web. In: Proceedings of the 22nd International Conference on Computational Linguistics (2008)
Tong, R., Ma, B., Zhu, D., Li, H., Chng, E.S.: Integrating acoustic, prosodic and phonotactic features for spoken language identification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2006)
Tranter, S., Reynolds, D.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)
Tzanetakis, G., Cook, F.: A framework for audio analysis based on classification and temporal segmentation. In: Proceedings of the 25th EUROMICRO Conference, 1999, vol. 2, pp. 61–67. IEEE, New York (2002)
Urbanek, S.: audio: Audio Interface for R (2012). http://www.rforge.net/audio. R package version 0.1-3
Vertanen, K.: Baseline WSJ acoustic models for HTK and Sphinx: training recipes and recognition experiments. Tech. rep., Cavendish Laboratory, University of Cambridge (2006)
VoxForge: Free speech… recognition (2010). http://voxforge.org/
Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: A Flexible Open Source Framework for Speech Recognition, p. 18. Sun Microsystems, Inc., Mountain View (2004)
Wang, A.: An industrial strength audio search algorithm. In: International Conference on Music Information Retrieval (ISMIR) (2003)
Wassner, H., Chollet, G.: New cepstral representation using wavelet analysis and spectral transformation for robust speech recognition. In: Proceedings of ICSLP, vol. 96 (1996)
Woodland, P., Odell, J., Valtchev, V., Young, S.: Large vocabulary continuous speech recognition using HTK. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP-94, vol. 2 (1994)
Wooters, C., Huijbregts, M.: The ICSI RT07s speaker diarization system. In: Multimodal Technologies for Perception of Humans, pp. 509–519 (2009)
Xu, M., Duan, L., Cai, J., Chia, L., Xu, C., Tian, Q.: HMM-based audio keyword generation. In: Advances in Multimedia Information Processing-PCM 2004, pp. 566–574 (2005)
Yang, C., Lin, K.H.Y., Chen, H.H.: Emotion classification using web blog corpora. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (2007)
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University Engineering Department, Cambridge (2002)
Zhang, T., Kuo, C.: Hierarchical classification of audio data for archiving and retrieving. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 6, pp. 3001–3004 (1999)
Zhang, J., Whalley, J., Brooks, S.: A two phase method for general audio segmentation. In: IEEE International Conference on Multimedia and Expo, ICME 2009, pp. 626–629. IEEE (2009)
Zhang, X.: Audio Segmentation, Classification and Visualization. Ph.D. thesis, Auckland University of Technology (2009)
Zhu, X., Barras, C., Meignier, S., Gauvain, J.: Combining speaker identification and BIC for speaker diarization. In: Ninth European Conference on Speech Communication and Technology (2005)
Zhu, X., Barras, C., Lamel, L., Gauvain, J.: Speaker diarization: from broadcast news to lectures. In: Machine Learning for Multimodal Interaction, pp. 396–406 (2006)
Zwicker, E.: Subdivision of the audible frequency range into critical bands (Frequenzgruppen). Acoust. Soc. Am. J. 33, 248 (1961)
Acknowledgments
This work has been supported by a government client. The Pacific Northwest National Laboratory is managed for the US Department of Energy by Battelle Memorial Institute under Contract DE-AC05-76RL01830.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by M. Kankanhalli.
Rights and permissions
About this article
Cite this article
Hafen, R.P., Henry, M.J. Speech information retrieval: a review. Multimedia Systems 18, 499–518 (2012). https://doi.org/10.1007/s00530-012-0266-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-012-0266-0