Abstract
We present in this paper an approach based on the use of the International Phonetic Alphabet (IPA) for content-based indexing and retrieval of multilingual audiovisual documents. The approach works even if the languages of the document are unknown. It has been validated in the context of the “Star Challenge” search engine competition organized by the Agency for Science, Technology and Research (A*STAR) of Singapore. Our approach includes the building of an IPA-based multilingual acoustic model and a dynamic programming based method for searching document segments by “IPA string spotting”. Dynamic programming allows for retrieving the query string in the document string even with a significant transcription error rate at the phone level. The methods that we developed ranked us as first and third on the monolingual (English) search task, as fifth on the multilingual search task and as first on the multimodal (audio and image) search task.


Similar content being viewed by others
References
Ayache S, Quénot G (2007) Image and video indexing using networks of operators. J Image Video Process 2007(4):1–13. doi:10.1155/2007/56928
CCC (2005) http://www.dear.com/CCC/resources.htm
Clarkson P, Rosenfeld R (1997) Statistical language modeling using the CMU-Cambridge toolkit. In: Eurospeech’07, pp 2707–2710
Gauvain JL, Mariani JJ (1982) A method for connected word recognition and word spotting on a microprocessor. In: Proc. IEEE ICASSP 82, vol 2, pp 891–894
LDC (1993) http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S6B
LDC (1997) http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98S71
Le VB, Do-Dat T, Casteli E, Besacier L, Serignat JF (2004) Spoken and written language resources for Vietnamese. In: LREC’04, pp 599–602
Le VB, Besacier L, Schultz T (2006) Acoustic-phonetic similarities for context dependent acoustic model portability. In: Proc. IEEE ICASSP 2006
Li H, Ma B, Lee CH (2007) A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech and Language Processing 15:91–110
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110
Mäenpää Topi Pietikäinen Matti OT (2000) Texture classification by multi-predicate local binary pattern operators. In: 15th international conference on pattern recognition, vol 3, pp 951–954
Moraru D, Besacier L, Meignier S, Fredouille C, Bonastre JF (2004) Speaker diarization in the ELISA consortium over the last 4 years. In: RT2004 fall workshop
Placeway P, Chen S, Eskenazi M, Jain U, Parikh V, Raj B, Ravishankar M, Rosenfeld R, Seymore K, Siegler M, Stern R, Thayer (1997) The 1996 hub-4 sphinx-3 system. In: In DARPA speech recognition workshop. Chantilly
Schultz T, Waibel A (2001) Language independent and language adaptive acoustic modeling for speech recognition. Speech Commun 35:31–51
Singhal A, Buckley C, Mitra A (1996) Pivoted document length normalization. In: ACM SIGIR conference. ACM, New York, pp 21–29
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and trecvid. In: MIR’06: proceedings of the 8th ACM international workshop on multimedia information retrieval. ACM, New York, pp 321–330. doi:10.1145/1178677.1178722
Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Intl. conf. on spoken language processing. citeseer.ist.psu.edu/stolcke02srilm.html
Tan TP, Besacier L (2008) Improving pronunciation modeling for non-native speech recognition. In: Interspeech 2008
Acknowledgement
Part of this work has been supported by the Quaero programme.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Quénot, G., Tan, T.P., Le, V.B. et al. Content-based search in multilingual audiovisual documents using the International Phonetic Alphabet. Multimed Tools Appl 48, 123–140 (2010). https://doi.org/10.1007/s11042-009-0377-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-009-0377-6