Skip to main content
Log in

An algorithm for similar utterance section extraction for managing spoken documents

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

This paper proposes a new, efficient algorithm for extracting similar sections between two time sequence data sets. The algorithm, called Relay Continuous Dynamic Programming (Relay CDP), realizes fast matching between arbitrary sections in the reference pattern and the input pattern and enables the extraction of similar sections in a frame synchronous manner. In addition, Relay CDP is extended to two types of applications that handle spoken documents. The first application is the extraction of repeated utterances in a presentation or a news speech because repeated utterances are assumed to be important parts of the speech. These repeated utterances can be regarded as labels for information retrieval. The second application is flexible spoken document retrieval. A phonetic model is introduced to cope with the speech of different speakers. The new algorithm allows a user to query by natural utterance and searches spoken documents for any partial matches to the query utterance. We present herein a detailed explanation of Relay CDP and the experimental results for the extraction of similar sections and report results for two applications using Relay CDP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Abbreviations

CDP:

Continuous dynamic programming

URP:

Unit reference pattern

References

  1. Asahara, M., Matsumoto, Y.: Extended models and tools for high-performance part-of-speech tagger. In: Proceedings of the International Conference on Computational Linguistics, pp. 21–27 (2000)

  2. Bacchiani, M.: Automatic transcription of voicemail at AT&T. In: Proceedings of the International Conference on Acoustics Speech and Signal Processing, vol. 1, pp. 25–28 (2001)

  3. Zweig, G., Huang, J., Padmanabhan, M.: Information extraction from voicemail. ACL P01–1039, pp. 290–297 (2001)

  4. Hori, C., Furui, S., Malkin, R., Yu, H., Waibel, A.: A new approach to automatic speech summarization. IEEE Trans. Multimedia 5(3), pp. 368–278 (2003)

    Google Scholar 

  5. Kimball, O., Iyer, R., Gish, H., Miller, S., Richardson, F.: Extracting descriptive noun phrases from conversational speech. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 33–36 (2002)

  6. Peskin, B., Connolly, S., Gillick, L., Lowe, S., McAllaster, D., Nagesha, V., Mulbregt, P.V., Wegmann, S.: Improvements in Switchboard recognition and topic identification. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 303–306 (1996)

  7. Rose, R.C.: Techniques for information retrieval from speech messages. Lincoln Lab. J. 4(1), 45–59 (1991)

    Google Scholar 

  8. Huang, J., Goel, V., Gopinath, R., Kingsbury, B., Olsen, P., Visweswariah, K.: Large vocabulary conversational speech recognition with the extended maximum likelihood linear transformation (EMLLT) model. In: Proceedings of the International Conference on Spoken Language Processing (2002)

  9. Matsoukas, S., Colthurst, T., Kimball, O., Solomonoff, A., Richardson, F., Quillen, C., Gish, H., Dognin, P.: The 2001 BYBLOS English large vocabulary conversational speech recognition system. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 741–744 (2002)

  10. Nanjo, H., Kato, K., Kawahara, T.: Speaking rate dependent acoustic modeling for spontaneous lecture speech recognition. In: Proceedings of EUROSPEECH, vol. 4, pp. 2531–2534 (2002)

  11. Itoh, Y.: A matching algorithm between arbitrary sections of two speech data for speech retrieval by speech. In: Proceedings of the International Conference on Acoustics Speech and Signal Processing, vol. 1, pp. 593–596 (2001)

  12. Itoh, Y., Kiyama, J., Oka, R.: A proposal for a new algorithm of reference interval-free continuous DP for real-time speech or text retrieval. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 486–489 (1996)

  13. Kiyama, J., Itoh, Y., Oka, R.: Automatic detection of topic boundaries and keywords in arbitrary speech using incremental reference interval-free continuous DP. In: Proceedings of the International Conference on Spoken Language Processing, vol. 3, pp. 1946–1949 (1996)

  14. Itoh, Y., Tanaka, K.: Automatic labeling and digesting for lecture speech utilizing repeated speech by shift CDP. In: Proceedings of EUROSPEECH, vol. 3, pp. 1805–1807 (2001)

  15. Itoh, Y., Tanaka, K.: Speech labeling and the most frequent phrase extraction using same section in a presentation speech. In: Proceedings of the International Conference on Acoustics Speech and Signal Processing, vol. 1, pp. 737–740 (2002)

  16. Kojima, H., Itoh, Y., Oka, R.: Location identification of a mobile robot by applying reference interval-free continuous dynamic programming to time-varying images. In: Proceedings of the Symposium on Intelligent Robotics Systems (1995)

  17. Tanaka, K., Itoh, Y., Kojima, H., Fujimura, N.: Speech data retrieval system constructed on a universal phonetic code domain. ASRU2001 a01kt080, 1–4 (2001)

    Google Scholar 

  18. Tanaka, K., Kojima, H.: Speech recognition method with a language-independent intermediate phonetic code. In: Proceedings of the International Conference on Spoken Language Processing, vol. 4, pp. 191–194 (2000)

  19. Lee, S.W., Tanaka, K., Itoh, Y.: Open-vocabulary spoken document retrieval based on multilingual subphonetic segment recognition. In: Proceedings of the International Congress on Acoustics, vol. 2, pp. 1723–1726 (2004)

  20. Lee, S.W., Tanaka, K., Itoh, Y.: Multilayer subword units for open vocabulary spoken document retrieval. In: Proceedings of the International Conference on Spoken Language Processing, vol. 2, pp. 1553–1556 (2004)

  21. Ng, K., Zue, V.W.: Subword unit representations for spoken document retrieval. In: Proceedings of EUROSPEECH, pp. 1607–1610 (1997)

  22. James, D.A., Young, S.J.: A fast lattice-based approach to vocabulary independent wordspotting. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 377–380 (1994)

  23. Wechsler, M., Munteanu, E., Schäuble, P.: New techniques for open vocabulary spoken document retrieval. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 20–27 (1998)

  24. Kashino, K., Kurozumi, T., Murase, H.: A quick search method for audio and video signals based on histogram pruning. IEEE Trans. Multimedia 5, 348–357 (2003)

    Article  Google Scholar 

  25. Pauws, S.: CubyHum: A fully operational query by hamming system. In: Proceedings of the International Conferences on Music Information Retrieval, pp. 187–196 (2002)

  26. Song, J., Bae, S.Y., Yoon, K.: Mid-level music melody representation of polyphonic audio for query-by-hamming system. In: Proceedings of the International Conferences on Music Information Retrieval, pp. 133–139 (2002)

  27. Nishimura, T., Hashiguchi, H., Takita, J., Zhang, J.X., Goto, M., Oka, R.: Music signal spotting retrieval by a humming query using start frame feature dependent continuous dynamic programming. In: Proceedings of the International Conferences on Music Information Retrieval, pp. 211–218 (2001)

  28. Foote, J.T., Young, S.J., Jones, G.J.F., Jones, K.S.: Unconstrained keyword spotting using phone lattices with application to spoken document retrieval. Comput. Speech Lang. 11, 207–224 (1997)

    Article  Google Scholar 

  29. Knill, K.M., Young, S.J.: Fast implementation methods for Viterbi based word spotting. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 522–525 (1996)

  30. Sinclair, S., Watson, C.: The development of the Otago speech database. In: Kasabov, N. Cogill, G. (eds): Proceedings of ANNES'95. Los Alamitos, CA: IEEE Press (1995)

  31. Oka, R., Matsumura, H.: Speaker independent word speech recognition using the blurred orientation pattern obtained from the vector field of spectrum. In: Proceedings of the International Conference on Pattern Recognition, pp. 17–20 (1988)

  32. Sekine, M.: Spoken British and American English. Nan'un-Do, 67 (1976)

  33. Kawahara, T., Lee, A., Kobayashi, T., Takeda, K., Minematsu, N., Itou, K., Ito, A., Yamamoto, M., Yamada, A., Utsuro, T., Shikano, K.: Japanese dictation toolkit—1997 version. J. Acoust. Soc. Jpn. E 55(3), 175–180 (1999)

    Google Scholar 

  34. Wells, J.C.: Computer-coding the IPA: A proposed extension of SAMPA. http://www.phon.ucl.ac.uk/home/sampa/ipasam-x.pdf (1995)

  35. IPA: The international phonetic alphabet. J. IPA 23, 1 (1993)

    Google Scholar 

  36. Itou, K., Takeda, K.: Design and development of Japanese speech corpus for large vocabulary continuous speech recognition. In: Proceedings of the Oriental COCOSDA (1998)

  37. Sakoe, H., Chiba S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics, Speech Signal Process. ASSP-26(1), 43–49 (1978)

    Google Scholar 

  38. Voorhees, E., Garofolo, J., Jones, K.S.: The TREC-6 spoken document retrieval track. In: Proceedings of the DARPA Speech Recognition Workshop (1997)

  39. Chen, F., Withgott, M.: The use of emphasis to automatically summarize a spoken discourse. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 229–232 (1992)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yoshiaki Itoh.

Additional information

Yoshiaki Itoh has been an associate professor in the Faculty of Software and Information Science at Iwate Prefectural University, Iwate, Japan, since 2001. He received the B.E. degree, M.E. degree, and Dr. Eng. from Tokyo University, Tokyo, in 1987, 1989, and 1999, respectively. From 1989 to 2001 he was a researcher and a staff member of Kawasaki Steel Corporation, Tokyo and Okayama. From 1992 to 1994 he transferred as a researcher to Real World Computing Partnership, Tsukuba, Japan. Dr. Itoh's research interests include spoken document processing without recognition, audio and video retrieval, and real-time human communication systems. He is a member of ISCA, Acoustical Society of Japan, Institute of Electronics, Information and Communication Engineers, Information Processing Society of Japan, and Japan Society of Artificial Intelligence.

Kazuyo Tanaka has been a professor at the University of Tsukuba, Tsukuba, Japan, since 2002. He received the B.E. degree from Yokohama National University, Yokohama, Japan, in 1970, and the Dr. Eng. degree from Tohoku University, Sendai, Japan, in 1984. From 1971 to 2002 he was research officer of Electrotechnical Laboratory (ETL), Tsukuba, Japan, and the National Institute of Advanced Science and Technology (AIST), Tsukuba, Japan, where he was working on speech analysis, synthesis, recognition, and understanding, and also served as chief of the speech processing section. His current interests include digital signal processing, spoken document processing, and human information processing. He is a member of IEEE, ISCA, Acoustical Society of Japan, Institute of Electronics, Information and Communication Engineers, and Japan Society of Artificial Intelligence.

Shi-Wook Lee received the B.E. degree and M.E. degree from Yeungnam University, Korea and Ph.D. degree from the University of Tokyo in 1995, 1997, and 2001, respectively. Since 2001 he has been working in the Research Group of Speech and Auditory Signal Processing, the National Institute of Advanced Science and Technology (AIST), Tsukuba, Japan, as a postdoctoral fellow. His research interests include spoken document processing, speech recognition, and understanding.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Itoh, Y., Tanaka, K. & Lee, SW. An algorithm for similar utterance section extraction for managing spoken documents. Multimedia Systems 10, 432–443 (2005). https://doi.org/10.1007/s00530-005-0172-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-005-0172-9

Keywords

Navigation