Abstract
With more and more audio being captured and stored, there is a growing need for automatic audio indexing and retrieval techniques that can retrieve relevant audio pieces quickly on demand. This paper provides a comprehensive survey of audio indexing and retrieval techniques. We first describe main audio characteristics and features and discuss techniques for classifying audio into speech and music based on these features. Indexing and retrieval of speech and music is then described separately. Finally, significance of audio in multimedia indexing and retrieval is discussed.
Similar content being viewed by others
References
P. Aigrain, H. Zhang, and D. Petkovic, “Content-based representation and retrieval of visual media: A stateof the-art review,” Journal of Multimedia Tools and Applications, Vol. 3, pp. 179–202, 1996.
J.R. Bach, “The virage image search engine: An open framework for image management,” in Proceedings of Conference on Storage and Retrieval for Image and Video Databases IV (SPIE Proceedings Vol. 2670), 1–2 Feb., San Jose, California, 1996, pp. 76–87.
A.S. Bregman, Auditory Scene Analysis—The Perception Organization of Sound, The MIT Press: Cambridge, MA, 1990.
R. Comerford, J. Makhoul, and R. Schwartz, “The voice of the computer is heard in the land (and it listens too!),” IEEE Spectrum, Vol. 34, No. 12, pp. 39–47, 1997.
V. Digalakis, S. Berkowitz, E. Bocchieri, C. Boulis, W. Byrne, H. Collier, A. Corduneanu, A. Kannan, S. Khudanpur, and A. Sankar, “Rapid speech recognizer adaptation to new speakers,” in 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, March 15–19, Phoenix, Arizona, Vol. II, 1999, pp. 765–768.
J.T. Foote, “A similarity measure for automatic audio classification,” in Pro. AAAI 1997 Spring Symposium on Intelligent Integration and Use of Text, Image, Video and Audio Corpora, Stanford, Palo Alto, CA, Mar. 1997.
W.B. Frakes and R. Baeza-Yates (Eds.), Information Retrieval: Data structures and Algorithms, Prentice Hall: Englewood Cliffs, NJ, 1992.
A. Ghias et al., “Query by humming—Musical information retrieval in an audio database,” in Proceedings of ACM Multimedia 95, November 5–9, San Francisco, California, 1995.
S.J. Gibbs and D.C. Tsichritzis, Multimedia Programming—Objects, Environments and Frameworks, Addison-Wesley Publishing Company: Reading, MA, 1995.
A.G. Hauptmann, M.J. Witbrock, A.I. Rudnicky, and S. Reed, “Speech for multimedia information retrieval,” in UIST-95 Proceedings of the User Interface Software Technology Conference, Pittsburgh, Nov. 1995.
R.L. Klevans and R.D. Rodman, Voice Recognition, Artech House: Boston, MA, 1997.
G. Lu and T. Hankinson, “A technique towards automatic audio classification and retrieval,” in Proceedings of International Conference on Signal Processing, Oct. 12–16, Beijing, China, 1998.
P.A. Lynn and W. Fuerst, Introductory Digital Signal Processing with Computer Applications, John Wiley & Sons: New York, 1989.
K.D. Martin, “Automatic transcription of simple polyphonic music: Robust front end processing,” M.I.T. Media Laboratory Perceptual Computing Section Technical Report No. 399, 1996, available at http://sound.media.mit.edu/papers.html.
R.J. McNab et al., “The New Zealand digital library MELody inDex,” D-Lib Magazine, May 1997, available at http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/may97/meldex/05written.html.
K. Minami et al., “Enhanced video handling based on audio analysis,” in Proceedings of IEEE International Conference on Multimedia Computing and Systems, June 3–6, Ottawa, Canada, 1997, pp. 219–226.
B.C.J. Moore, An Introduction to Psychology of Hearing, Academic Press: New York, 1997.
D.P. Morgan and C.L. Scofield, Neural Networks and Speech Processing, Kluwer: Dordrecht, 1991.
W. Niblack, X. Zhu, J.L. Hafner, T. Breuel, D.B. Panceleon, D. Petkovic, M.D. Flickner, E. Upfal, S.I. Nin, S. Sull, B.E. Dom, B.-L. Yeo, S. Srinivansan, D. Zivkovic and M. Penner, “Updates to the QBIC system,” in Proceedings of Conference on Storage and Retrieval for Image and Video Databases VI (SPIE Proceedings Vol. 3312), 28–30 Jan., San Jose, California, 1998, pp. 150–161.
N.V. Patel and I.K. Sethi, “Audio characterization for video indexing,” SPIE Proceedings, Vol. 2670, pp. 373–384, 1996.
A.W. Peevers, “A real time 3D signal analysis/synthesis tool based on the short time fourier transform,” http://cnmat.CNMAT.Berkeley.EDU/~alan/MS-html/MSthesis.v2ToC.html.
S. Pfeiffer, S. Fischer, and W. Effelsberg, “Automatic audio content analysis,” http://www.informatik.unimannheim.de/informatic/pi4/projects/MoCA/.
R. Polikar, “The wavelet tutorial,” http://www.public.iastate.edu/¡«rpolikar/WAVELETS/WTtutorial.htm.
L.R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in Proceedings of The IEEE, Vol. 77, No. 2, 1989.
L.R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall: Englewood Cliffs, NJ, 1993.
G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill: New York, 1983.
J. Saunders, “Real-time discrimination of broadcast speech/music,” in Proceedings ACASSP'96, Vol. 2, 1996, pp. 993–996.
E.D. Scheirer, “Tempo and beat analysis of acoustic music signals,” http://sound.media.mit.edu/~eds/papers/ beat-track.html.
E.D. Scheirer, “The MPEG-4 structured audio standard,” in Proc. IEEE ICASSP 1998, also available at http://sound.media.mit.edu/papers.html.
E.D. Scheirer, “Using musical knowledge to extract expressive performance information from audio recordings,” available at http://sound.media.mit.edu/papers.html.
E. Scheirer and M. Slaney, “Construction and evaluation of a robust multifeature speech/music discriminator,” in Proceedings of the 1997 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 21–24, Munich, Germany, 1997. Also available at http://web.interval.com/papers/1996-085/index.html.
J.R. Smith and S.-F. Chang, “Visually searching the web for content,” IEEE Multimedia Magazine, July–Sept., pp. 12–19, 1997.
S. Subramanya et al., “Transform-based indexing of audio data for multimedia databases,” in Proceedings of IEEE International Conference on Multimedia Computing and Systems, June 3–6, Ottawa, Canada, 1997, pp. 211–218.
The CMU Speech Project, http://www.speech.cs.cmu.edu/speech.
M.J. Witbrock and A.G. Hauptmann, “Speech recognition and information retrieval,” in Proceedings of the 1997 DARPA Speech Recognition Workshop, February 2–5, 1997.
E. Wold et al., “Content-based classification, search, and retrieval of audio,” IEEE Multimedia, Vol. 3, No. 3, pp. 27–36, 1996.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Lu, G. Indexing and Retrieval of Audio: A Survey. Multimedia Tools and Applications 15, 269–290 (2001). https://doi.org/10.1023/A:1012491016871
Issue Date:
DOI: https://doi.org/10.1023/A:1012491016871