Speech information retrieval: a review

Hafen, Ryan P.; Henry, Michael J.

doi:10.1007/s00530-012-0266-0

Speech information retrieval: a review

Regular Paper
Published: 26 July 2012

Volume 18, pages 499–518, (2012)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Ryan P. Hafen¹ &
Michael J. Henry¹

1030 Accesses
12 Citations
Explore all metrics

Abstract

Speech is an information-rich component of multimedia. Information can be extracted from a speech signal in a number of different ways, and thus there are several well-established speech signal analysis research fields. These fields include speech recognition, speaker recognition, event detection, and fingerprinting. The information that can be extracted from tools and methods developed in these fields can greatly enhance multimedia systems. In this paper, we present the current state of research in each of the major speech analysis fields. The goal is to introduce enough background for someone new in the field to quickly gain high-level understanding and to provide direction for further study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Speech coding techniques and challenges: a comprehensive literature survey

Article 14 September 2023

Nagaraja B G, Mohamed Anees & Thimmaraja Yadava G

A Comprehensive Review of Recent Automatic Speech Summarization and Keyword Identification Techniques

A Study on Speech Processing

References

Adami, A., Mihaescu, R., Reynolds, D., Godfrey, J.: Modeling prosodic dynamics for speaker recognition. In: Proceedings of the ICASSP, vol. 4, pp. 788–791 (2003)
Allamanche, E., Herre, J., Hellmuth, O., Fröba, B., Kastner, T., Cremer, M.: Content-based identification of audio material using MPEG-7 low level description. In: Proceedings of the International Symposium of Music Information Retrieval (2001)
Allegro, S., Buchler, M., Launer, S.: Automatic sound classification inspired by auditory scene analysis. In: Consistent and Reliable Acoustic Cues for Sound Analysis (CRAC), One-Day Workshop, Aalborg, Denmark (2001)
Al-Sawalmeh, W., Daqrouq, K., Daoud, O., Al-Qawasmi, A.: Speaker identification system-based mel frequency and wavelet transform using neural network classifier. Eur. J. Sci. Res. 41(4), 515–525 (2010)
Google Scholar
Anguera, X., Wooters, C., Pardo, J.: Robust speaker diarization for meetings: ICSI RT06s evaluation system. In: Ninth International Conference on Spoken Language Processing (2006)
Azmi, M., Tolba, H., Mahdy, S., Fashal, M.: Syllable-based automatic Arabic speech recognition. In: Proceedings of the 7th WSEAS International Conference on Signal Processing, Robotics and Automation, pp. 246–250. World Scientific and Engineering Academy and Society (WSEAS), Greece (2008)
Baker, J., Deng, L., Glass, J., Khudanpur, S., Lee, C., Morgan, N., O’Shaugnessy, D.: Research developments and directions in speech recognition and understanding. Part 1. IEEE Signal Process. Mag. 26(3), 75–80 (2009)
Article Google Scholar
Barbu, T.: A supervised text-independent speaker recognition approach. World Acad. Sci. Eng. Technol. 33 (2007)
Barras, C., Zhu, X., Meignier, S., Gauvain, J.: Improving speaker diarization. In: RT-04F Workshop (2004)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4) (2002)
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., Wellekens, C.: Automatic speech recognition and speech variability: a review. Speech Commun. 49(10–11), 763–786 (2007)
Article Google Scholar
Bimbot, F., Bonastre, J.F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcıa, J.: A tutorial on text-independent speaker verification. EURASIP J. Appl. Signal Process. 4, 430–451 (2004)
Google Scholar
Bonastre, J., Wils, F., Meignier, S.: ALIZE, a free toolkit for speaker recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), Philadelphia, USA, pp. 737–740 (2005)
Bonastre, J., Scheffer, N., Matrouf, D., Fredouille, C., Larcher, A., Preti, A., Pouchoulin, G., Evans, N., Fauve, B., Mason, J.: ALIZE/SpkDet: a state-of-the-art open source software for speaker recognition. In: Odyssey-The Speaker and Language Recognition Workshop (2008)
Brill, E.: Discovering the lexical features of a language. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, pp. 339–340. Association for Computational Linguistics (1991)
Brümmer, N., du Preez, J.: Application-independent evaluation of speaker detection. Comput. Speech Lang. 20(2–3), 230–275 (2006)
Article Google Scholar
Burges, C., Platt, J., Jana, S.: Distortion discriminant analysis for audio fingerprinting. IEEE Trans. Speech Audio Process. 11(3), 165–174 (2003)
Article Google Scholar
Camastra, F., Vinciarelli, A., Yu, J.: Machine learning for audio, image and video analysis. J. Electron. Imaging 18, 029901 (2009)
Article Google Scholar
Campbell, J., Reynolds, D., Dunn, R.: Fusing high-and low-level features for speaker recognition. In: Eighth European Conference on Speech Communication and Technology (2003)
Campbell, W., Sturim, D., Reynolds, D.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5) (2006)
Campbell, J.P., Shen, W., Campbell, W.M., Schwartz, R., Bonastre, J.F., Matrouf, D.: Forensic speaker recognition. Signal Process. Mag. IEEE 26(2009), 95–103 (2009)
Article Google Scholar
Cano, P., Batlle, E., Kalker, T., Haitsma, J.: A review of audio fingerprinting. J. VLSI Signal Process. 41(3), 271–284 (2005)
Article Google Scholar
Canseco-Rodriguez, L., Lamel, L., Gauvain, J.: Speaker diarization from speech transcripts. In: Proceedings of the ICSLP, vol. 4 (2004)
Casey, M.: General sound classification and similarity in MPEG-7. Organ. Sound 6(2), 153–164 (2002)
Google Scholar
Cohen, L.: Time frequency distributions—a review. In: Proceedings of the IEEE, vol. 77 (1989)
de Jong, F., Gauvain, J.L., Hiemstra, D., Netter, K.: Language-based multimedia information retrieval. In: In 6th RIAO Conference (2000)
Dunning, T.: Statistical identification of language. Tech. Rep. MCCS 94-273, New Mexico State University (1994)
Dusan, S., Deng, L.: Estimation of articulatory parameters from speech acoustics by Kalman filtering. In: Proceedings of CITO Researcher Retreat-Hamilton (1998)
ELDA: Evaluations and Language Resources Distribution Agency (2010). http://www.elda.org/
Fauve, B.G.B., Matrouf, D., Scheffer, N., Bonastre, J.F.F., Mason, J.S.D.: State-of-the-art performance in text-independent speaker verification through open-source software. IEEE Trans. Audio Speech Lang. Process. 15(7), 1960–1968 (2007)
Article Google Scholar
Ferrer, L., Scheffer, N., Shriberg, E.: A comparison of approaches for modeling prosodic features in speaker recognition. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4414–4417. IEEE, New York (2010)
Friedland, A., Vinyals, B., Huang, C., Muller, D.: Fusing short term and long term features for improved speaker diarization. In: Acoustics, Speech and Signal Processing, ICASSP 2009, pp. 4077–4080. IEEE (2009)
Fulop, S., Disner, S.: The reassigned spectrogram as a tool for voice identification. In: Proceedings of ICPhS 2007, pp. 1853–1856 (2007)
Fulop, S., Disner, S.: Advanced time-frequency displays applied to forensic speaker identification. Proc. Meet. Acoust. 6, 060008 (2009)
Google Scholar
Gang, C., Hui, T., Xin-meng, C.: Audio segmentation via the similarity measure of audio feature vectors. Wuhan Univ. J. Nat. Sci. 10(5), 833–837 (2005)
Article Google Scholar
Gannert, T.: A Speaker Verification System Under the Scope: Alize. Master’s thesis, TMH (2007)
Gravier, G., Betser, M., Ben, M.: Audio Segmentation Toolkit, release 1.2. IRISA (2010)
Haitsma, J., Kalker, T.: A highly robust audio fingerprinting system with an efficient search strategy. J. New Music Res. 32(2), 211–221 (2003)
Article Google Scholar
Haitsma, J., Kalker, T., Oostveen, J.: Robust audio hashing for content identification. In: Proceedings of the Content-Based Multimedia Indexing (2001)
Hansen, J., Bou-Ghazale, S., Sarikaya, R., Pellom, B.: Getting started with the SUSAS: speech under simulated and actual stress database. In: Robust Speech Processing Laboratory (1998)
Hansen, J.H., Gavidia-Ceballos, L., Kaiser, J.F.: A nonlinear operator-based speech feature analysis method with application to vocal fold pathology assessment. In: IEEE Transactions on Biomedical Engineering (1998)
Harris, F.: On the use of windows for harmonic analysis with the discrete Fourier transform. Proc. IEEE 66(1), 51–83 (1978)
Article Google Scholar
Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
Article Google Scholar
Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)
Article Google Scholar
Heymann, M.: sound: A sound interface for R. R package version 1.3 (2010). http://CRAN.R-project.org/package=soun
Huijbregts, M.: Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled. PrintPartners Ipskamp, Enschede (2008)
Google Scholar
ISIP: Automatic speech recognition (2010). http://www.isip.piconepress.com/projects/speech/index.html
Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1997)
Google Scholar
Jiang, D.N., Cai, L.H.: Speech emotion classification with the combination of statistic features and temporal features. In: IEEE International Conference on Multimedia and Expo (2004)
Jin, Q.: Robust Speaker Recognition. Ph.D. thesis, Carnegie Mellon University (2007)
Kajarekar, S., Ferrer, L., Stolcke, A., Shriberg, E.: Voice-based speaker recognition combining acoustic and stylistic features. In: Advances in Biometrics, pp. 183–201 (2008)
Kenny, P., Boulianne, G., Dumouchel, P.: Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13(3), 345–354 (2005)
Article Google Scholar
Kimura, A., Kashino, K., Kurozumi, T., Murase, H.: Very quick audio searching: introducing global pruning to the time-series active search. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 3 (2001)
Kinnunen, T., Li, H.: An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
Article Google Scholar
Kinnunen, T.: Spectral features for automatic text-independent speaker recognition. Ph. Lic. thesis, University of Joensuu, Department of Computer Science (2004)
Larcher, A., Lévy, C., Matrouf, D., Bonastre, J.: LIA NIST-SRE’10 systems. Unpublished (2010)
LDC: Language Data Consortium (2010). http://www.ldc.upenn.edu/
Lee, A., Kawahara, T., Takeda, K., Mimura, M., Yamada, A., Ito, A., Itou, K., Shikano, K.: Continuous speech recognition consortium—an open repository for CSR tools and models. In: Proceedings of the IEEE International Conference on Language Resources and Evaluation (2002)
Lee, C.H.: Back to speech science-towards a collaborative ASR community of the 21st century. In: Dynamics of Speech Production and Perception, p. 221 (2006)
Li, S.: Content-based audio classification and retrieval using the nearest feature line method. IEEE Trans. Speech Audio Process. 8(5), 619–625 (2002)
Article Google Scholar
Li, D., Sethi, I., Dimitrova, N., McGee, T.: Classification of general audio data for content-based retrieval. Pattern Recognit. Lett. 22(5), 533–544 (2001)
Article MATH Google Scholar
Li, X., Tao, J., Johnson, M.T., Soltis, J., Savage, A., Leong, K.M., Newman, J.D.: Stress and emotion classification using Jitter and Shimmer features. In: IEEE International Conference on Acoustics Speech and Signal Processing (2007)
Li, H., Ma, B., Lee, C.: A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Lang. Process. 15(1), 271–284 (2007)
Article Google Scholar
Linguistic Data Consortium (2010). http://www.ldc.upenn.edu/
Liscombe, J., Riccardi, G., Hakkaini-Tür, D.: Using context to improve emotion detection in spoken dialog systems. In: Proceedings of Interspeech (2005)
Low, L.S.A., Maddage, N.C., Lech, M., Sheeber, L.B., Allen, N.B.: Detection of clinical depression n adolescents’ speech during family interactions. In: IEEE Transactions on Biomedical Engineering (2011)
Lu, L., Zhang, H., Li, S.: Content-based audio classification and segmentation by using support vector machines. Multimed. Syst. 8(6), 482–492 (2003)
Article Google Scholar
Lu, H., Pan, W., Lane, N., Choudhury, T., Campbell, A.: SoundSense: scalable sound sensing for people-centric applications on mobile phones. In: Proceedings of the 7th International Conference on Mobile Systems, Applications, and Services, pp. 165–178. ACM, New York (2009)
Ma, G., Zhou, W., Zheng, J., You, X., Ye, W.: A comparison between HTK and SPHINX on Chinese Mandarin. In: Proceedings of the 2009 International Joint Conference on Artificial Intelligence, pp. 394–397. IEEE Computer Society, New York (2009)
Makhoul, J.: Information extraction from speech. In: Spoken Language Technology Workshop, 2006, p. 3. IEEE, New York (2007)
Meignier, S., Moraru, D., Fredouille, C., Bonastre, J., Besacier, L.: Step-by-step and integrated approaches in broadcast news speaker diarization. Comput. Speech Lang. 20(2–3), 303–330 (2006)
Article Google Scholar
Meignier, S., Merlin, T.: Lium SpkDiarization: an open source toolkit for diarization. In: CMU SPUD Workshop (2010)
Meinedo, H., Neto, J.: Audio segmentation, classification and clustering in a broadcast news task. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 2. IEEE, New York (2003)
Milner, B., Shao, X.: Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model. In: Seventh International Conference on Spoken Language Processing (2002)
Miotto, R., Orio, N.: Automatic identification of music works through audio matching. In: ECDL (2007)
Moore, E. II, Clements, M.A., Peifer, J.W., Weisser, L.: Critical analysis of the impact of glottal features in the classification of clinical depression in speech. In: IEEE Transactions on Biomedical Engineering (2008)
Nexidia: Nexidia Rich Media (2010). http://www.nexidia.com/solutions/rich_media
NIST: Nist Language Recognition Evaluation (2010). http://www.itl.nist.gov/iad/mig/tests/lre/
NIST: Nist Speaker Recognition Evaluation (2010). http://www.itl.nist.gov/iad/mig//tests/sre/
NIST: Rich Transcription Evaluation Project (2010). http://www.itl.nist.gov/iad/mig//tests/rt/
Nwe, T.L., Wei, F.S., Silva, L.D.: Speech based emotion classification. In: Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology (2001)
OLAC: Open Language Archives Community (2010). http://www.language-archives.org/
O’Shaughnessy, D.: Interacting with computers by voice: automatic speech recognition and synthesis. Proc. IEEE 91(9), 1272–1305 (2003)
Article Google Scholar
Padgett, C., Cottrell, G.: Representing face images for emotion classification. In: Advances in Neural Information Processing Systems (1997)
Pallett, D.: A look at NIST’s benchmark ASR tests: past, present, and future. In: Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (2003)
Papaodysseus, C., Roussopoulos, G., Fragoulis, D., Panagopoulos, T., Alexiou, C.: A new approach to the automatic recognition of musical recordings. J. Audio Eng. Soc. 49(1/2), 23–35 (2001)
Google Scholar
Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: A Speaker Odyssey-The Speaker Recognition Workshop (2001)
Petrovska-Delacrétaz, D., El Hannani, A., Chollet, G.: Text-independent speaker verification: state of the art and challenges. In: Progress in Nonlinear Speech Processing, pp. 135–169 (2007)
Poutsma, A.: Applying Monte Carlo techniques to language identification. In: Proceedings of Computational Linguistics in the Netherlands (CLIN) (2001)
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2010). http://www.R-project.or. ISBN 3-900051-07-0
Ramachandran, R., Farrell, K., Ramachandran, R., Mammone, R.: Speaker recognition—general classifier approaches and data fusion methods. Pattern Recognit. 35(12), 2801–2821 (2002)
Article MATH Google Scholar
Ravindran, S., Anderson, D., Slaney, M.: Low-power audio classification for ubiquitous sensor networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (2004)
Recognition Technologies, Inc. (2010). http://www.recotechnologies.com
Rehurek, R., Kolkus, M.: Language identification on the web: extending the dictionary method. Lect. Notes Comput. Sci. 5449, 357–368 (2009)
Article Google Scholar
Reynolds, D.: An overview of automatic speaker recognition technology. IEEE Int. Conf. Acoust. Speech Signal Process. 4, 4072–4075 (2002)
Google Scholar
Reynolds, D.: Channel robust speaker verification via feature mapping. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 2 (2003)
Reynolds, D., Campbell, J., Campbell, W., Dunn, R., Gleason, T., Jones, D., Quatieri, T., Quillen, C., Sturim, D., Torres-Carrasquillo, P.: Beyond cepstra: exploiting high-level information in speaker recognition. In: Proceedings of the Workshop on Multimodal User Authentication, pp. 223–229 (2003)
Reynolds, D., Torres-Carrasquillo, P.: The MIT Lincoln laboratory RT-04F diarization systems: applications to broadcast audio and telephone conversations. In: RT-04F Workshop (2004)
Reynolds, D., Torres-Carrasquillo, P.: Approaches and applications of audio diarization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05) (2005)
Rose, P.: Forensic Speaker Identification. CRC, Boca Raton (2002)
Book Google Scholar
Satori, H., Hiyassat, H., Harti, M., Chenfour, N.: Investigation Arabic speech recognition using CMU Sphinx system. Int. Arab J. Inf. Technol. 6(2) (2009)
Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., Aharonson, V.: The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: INTERSPEECH (2007)
Sinha, R., Tranter, S., Gales, M., Woodland, P.: The Cambridge University March 2005 speaker diarisation system. In: Ninth European Conference on Speech Communication and Technology (2005)
Sonmez, M., Heck, L., Weintraub, M., Shriberg, E.: A lognormal tied mixture model of pitch for prosody-based speaker recognition. In: Proceedings of the Eurospeech, vol. 3, pp. 1391–1394 (1997)
Sonmez, K., Shriberg, E., Heck, L., Weintraub, M.: Modeling dynamic prosodic variation for speaker verification. In: Fifth International Conference on Spoken Language Processing (1998)
SpeecFind: Search the Speech from Last Century (2010). http://speechfind.utdallas.edu/
Stallard, D., Prasad, R., Natarajan, P.: Development and internal evaluation of speech-to-speech translation technology at BBN. In: PerMIS ’09: Proceedings of the 9th Workshop on Performance Metrics for Intelligent Systems, pp. 231–237. ACM, New York (2009). doi:10.1145/1865909.1865956
Stevens, S., Volkmann, J., Newman, E.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8, 185 (1937)
Article Google Scholar
Sueur, J., Aubin, T., Simonis, C.: Seewave: a free modular tool for sound analysis and synthesis. Bioacoustics 18, 213–226 (2008). http://sueur.jerome.perso.neuf.fr/WebPage_PapersPDF/Sueuretal_Bioacoustics_2008.pdf
Google Scholar
Sukittanon, S., Atlas, L.: Modulation frequency features for audio fingerprinting. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 2 (2002)
Switchboard: Spontaneous conversation corpus (2010). http://www.isip.piconepress.com/projects/switchboard/html/overview.html
Teager, H.: Some observations on oral air flow during phonation. In: IEEE Transactions on Acoustics, Speech and Signal Processing (1980)
Tokuhisa, R., Inui, K., Matsumoto, Y.: Emotion classification using massive examples extracted from the web. In: Proceedings of the 22nd International Conference on Computational Linguistics (2008)
Tong, R., Ma, B., Zhu, D., Li, H., Chng, E.S.: Integrating acoustic, prosodic and phonotactic features for spoken language identification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2006)
Tranter, S., Reynolds, D.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)
Article Google Scholar
Tzanetakis, G., Cook, F.: A framework for audio analysis based on classification and temporal segmentation. In: Proceedings of the 25th EUROMICRO Conference, 1999, vol. 2, pp. 61–67. IEEE, New York (2002)
Urbanek, S.: audio: Audio Interface for R (2012). http://www.rforge.net/audio. R package version 0.1-3
Vertanen, K.: Baseline WSJ acoustic models for HTK and Sphinx: training recipes and recognition experiments. Tech. rep., Cavendish Laboratory, University of Cambridge (2006)
VoxForge: Free speech… recognition (2010). http://voxforge.org/
Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: A Flexible Open Source Framework for Speech Recognition, p. 18. Sun Microsystems, Inc., Mountain View (2004)
Wang, A.: An industrial strength audio search algorithm. In: International Conference on Music Information Retrieval (ISMIR) (2003)
Wassner, H., Chollet, G.: New cepstral representation using wavelet analysis and spectral transformation for robust speech recognition. In: Proceedings of ICSLP, vol. 96 (1996)
Woodland, P., Odell, J., Valtchev, V., Young, S.: Large vocabulary continuous speech recognition using HTK. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP-94, vol. 2 (1994)
Wooters, C., Huijbregts, M.: The ICSI RT07s speaker diarization system. In: Multimodal Technologies for Perception of Humans, pp. 509–519 (2009)
Xu, M., Duan, L., Cai, J., Chia, L., Xu, C., Tian, Q.: HMM-based audio keyword generation. In: Advances in Multimedia Information Processing-PCM 2004, pp. 566–574 (2005)
Yang, C., Lin, K.H.Y., Chen, H.H.: Emotion classification using web blog corpora. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (2007)
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University Engineering Department, Cambridge (2002)
Google Scholar
Zhang, T., Kuo, C.: Hierarchical classification of audio data for archiving and retrieving. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 6, pp. 3001–3004 (1999)
Zhang, J., Whalley, J., Brooks, S.: A two phase method for general audio segmentation. In: IEEE International Conference on Multimedia and Expo, ICME 2009, pp. 626–629. IEEE (2009)
Zhang, X.: Audio Segmentation, Classification and Visualization. Ph.D. thesis, Auckland University of Technology (2009)
Zhu, X., Barras, C., Meignier, S., Gauvain, J.: Combining speaker identification and BIC for speaker diarization. In: Ninth European Conference on Speech Communication and Technology (2005)
Zhu, X., Barras, C., Lamel, L., Gauvain, J.: Speaker diarization: from broadcast news to lectures. In: Machine Learning for Multimodal Interaction, pp. 396–406 (2006)
Zwicker, E.: Subdivision of the audible frequency range into critical bands (Frequenzgruppen). Acoust. Soc. Am. J. 33, 248 (1961)
Article Google Scholar

Download references

Acknowledgments

This work has been supported by a government client. The Pacific Northwest National Laboratory is managed for the US Department of Energy by Battelle Memorial Institute under Contract DE-AC05-76RL01830.

Author information

Authors and Affiliations

Pacific Northwest National Laboratory, Richland, WA, USA
Ryan P. Hafen & Michael J. Henry

Authors

Ryan P. Hafen
View author publications
You can also search for this author in PubMed Google Scholar
Michael J. Henry
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ryan P. Hafen.

Additional information

Communicated by M. Kankanhalli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hafen, R.P., Henry, M.J. Speech information retrieval: a review. Multimedia Systems 18, 499–518 (2012). https://doi.org/10.1007/s00530-012-0266-0

Download citation

Received: 21 January 2011
Accepted: 19 April 2012
Published: 26 July 2012
Issue Date: November 2012
DOI: https://doi.org/10.1007/s00530-012-0266-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Speech information retrieval: a review

Abstract

Access this article

Similar content being viewed by others

Speech coding techniques and challenges: a comprehensive literature survey

A Comprehensive Review of Recent Automatic Speech Summarization and Keyword Identification Techniques

A Study on Speech Processing

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech information retrieval: a review

Abstract

Access this article

Similar content being viewed by others

Speech coding techniques and challenges: a comprehensive literature survey

A Comprehensive Review of Recent Automatic Speech Summarization and Keyword Identification Techniques

A Study on Speech Processing

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation