Skip to main content
Log in

Content-based search in multilingual audiovisual documents using the International Phonetic Alphabet

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

We present in this paper an approach based on the use of the International Phonetic Alphabet (IPA) for content-based indexing and retrieval of multilingual audiovisual documents. The approach works even if the languages of the document are unknown. It has been validated in the context of the “Star Challenge” search engine competition organized by the Agency for Science, Technology and Research (A*STAR) of Singapore. Our approach includes the building of an IPA-based multilingual acoustic model and a dynamic programming based method for searching document segments by “IPA string spotting”. Dynamic programming allows for retrieving the query string in the document string even with a significant transcription error rate at the phone level. The methods that we developed ranked us as first and third on the monolingual (English) search task, as fifth on the multilingual search task and as first on the multimodal (audio and image) search task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. http://hlt.i2r.a-star.edu.sg/starchallenge

  2. http://www.speech.cs.cmu.edu/sphinx/

  3. http://cmusphinx.sourceforge.net/sphinx3/doc/s3_overview.html

  4. http://www.speech.cs.cmu.edu/sphinx/models/

  5. http://www.speech.cs.cmu.edu/cgi-bin/cmudict

References

  1. Ayache S, Quénot G (2007) Image and video indexing using networks of operators. J Image Video Process 2007(4):1–13. doi:10.1155/2007/56928

    Article  Google Scholar 

  2. CCC (2005) http://www.dear.com/CCC/resources.htm

  3. Clarkson P, Rosenfeld R (1997) Statistical language modeling using the CMU-Cambridge toolkit. In: Eurospeech’07, pp 2707–2710

  4. Gauvain JL, Mariani JJ (1982) A method for connected word recognition and word spotting on a microprocessor. In: Proc. IEEE ICASSP 82, vol 2, pp 891–894

  5. LDC (1993) http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S6B

  6. LDC (1997) http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98S71

  7. Le VB, Do-Dat T, Casteli E, Besacier L, Serignat JF (2004) Spoken and written language resources for Vietnamese. In: LREC’04, pp 599–602

  8. Le VB, Besacier L, Schultz T (2006) Acoustic-phonetic similarities for context dependent acoustic model portability. In: Proc. IEEE ICASSP 2006

  9. Li H, Ma B, Lee CH (2007) A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech and Language Processing 15:91–110

    Google Scholar 

  10. Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110

    Article  Google Scholar 

  11. Mäenpää Topi Pietikäinen Matti OT (2000) Texture classification by multi-predicate local binary pattern operators. In: 15th international conference on pattern recognition, vol 3, pp 951–954

  12. Moraru D, Besacier L, Meignier S, Fredouille C, Bonastre JF (2004) Speaker diarization in the ELISA consortium over the last 4 years. In: RT2004 fall workshop

  13. Placeway P, Chen S, Eskenazi M, Jain U, Parikh V, Raj B, Ravishankar M, Rosenfeld R, Seymore K, Siegler M, Stern R, Thayer (1997) The 1996 hub-4 sphinx-3 system. In: In DARPA speech recognition workshop. Chantilly

  14. Schultz T, Waibel A (2001) Language independent and language adaptive acoustic modeling for speech recognition. Speech Commun 35:31–51

    Article  MATH  Google Scholar 

  15. Singhal A, Buckley C, Mitra A (1996) Pivoted document length normalization. In: ACM SIGIR conference. ACM, New York, pp 21–29

    Google Scholar 

  16. Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and trecvid. In: MIR’06: proceedings of the 8th ACM international workshop on multimedia information retrieval. ACM, New York, pp 321–330. doi:10.1145/1178677.1178722

    Chapter  Google Scholar 

  17. Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Intl. conf. on spoken language processing. citeseer.ist.psu.edu/stolcke02srilm.html

  18. Tan TP, Besacier L (2008) Improving pronunciation modeling for non-native speech recognition. In: Interspeech 2008

Download references

Acknowledgement

Part of this work has been supported by the Quaero programme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Georges Quénot.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Quénot, G., Tan, T.P., Le, V.B. et al. Content-based search in multilingual audiovisual documents using the International Phonetic Alphabet. Multimed Tools Appl 48, 123–140 (2010). https://doi.org/10.1007/s11042-009-0377-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-009-0377-6

Keywords

Navigation