Abstract
This article describes a methodology for collecting text from the Web to match a target sublanguage both in style (register) and topic. Unlike other work that estimates n-gram statistics from page counts, the approach here is to select and filter documents, which provides more control over the type of material contributing to the n-gram counts. The data can be used in a variety of ways; here, the different sources are combined in two types of mixture models. Focusing on conversational speech where data collection can be quite costly, experiments demonstrate the positive impact of Web collections on several tasks with varying amounts of data, including Mandarin and English telephone conversations and English meetings and lectures.
- Akbacak, M., Gao, Y., Gu, L., and Kuo, H.-K. 2005. Rapid transition to new spoken dialog domains: Language model training using knowledge from previous domain applications and web text resources. In Proceedings of Interspeech. 1873--1876.Google Scholar
- Banko, M. and Brill, E. 2003. Mitigating the paucity-of-data problem: Exploring the effect of training corpus size on classifier performance for natural language processing. In Proceedings of the Conference on Human Language Technology. 253--257. Google ScholarDigital Library
- Bellegarda, J. 1998. Exploiting both local and global constraints for multispan statistical language modeling. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). vol. II, 677--680.Google Scholar
- Berger, A. and Miller, R. 1998. Just-in-time language modeling. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). vol. II, 705--708.Google Scholar
- Bessling, S. and Meier, H. 1995. Language model speaker adaptation. In Proceedings of the Eurospeech. 1755--1758.Google Scholar
- Biber, D. 1988. Variation Across Speech and Writing. Cambridge University Press.Google Scholar
- Biber, D. 1993. Using register-diversified corpora for general language studies. Computat. Linguis. 19, 2, 219--242. Google ScholarDigital Library
- Boulis, C. 2005. Topic learning in text and conversational speech. Ph.D. thesis, University of Washington. Google ScholarDigital Library
- Bulyko, I., Ostendorf, M., and Stolcke, A. 2003. Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures. In Proceedings of the HLT/NAACL. 7--9. Google ScholarDigital Library
- Çetin, O. and Stolcke, A. 2005. Language modeling in the ICSI-SRI Spring 2005 Meeting speech recognition evaluation system. Tech. rep. tr-05-06, International Computer Science Institute.Google Scholar
- Chen, S. and Goodman, J. 1999. An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13, 4, 359--394.Google ScholarDigital Library
- Cieri, C., Miller, D., and Walker, K. 2003. From Switchboard to Fisher: Telephone collection protocols, their uses and yields. In Proceedings of Eurospeech. 1597--1600.Google Scholar
- Clarkson, P. and Robinson, A. 1997. Language model adaptation using mixtures and an exponentially decaying cache. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). vol. II, 799--802. Google ScholarDigital Library
- Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the em algorithm. J. Royal Statis. Soc. Series B 39, 1, 1--38.Google Scholar
- Duh, K. and Kirchhoff, K. 2005. Pos tagging of dialectal Arabic: A minimally supervised approach. In Proceedings of the Association for Computational Linguistics (ACL).Google Scholar
- Evermann, G., Chan, H., Gales, M., Hain, T., Liu, X., Mrva, D., Wang, L., and Woodland, P. 2004a. Development of the 2003 CU-HTK conversational telephone speech transcription system. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). vol. 1, 249--252.Google Scholar
- Evermann, G., Chan, H., Gales, M., Jia, B., Liu, X., Mrva, D., Sim, K., Wang, L., Woodland, P., and Yu, K. 2004b. Development of the 2004 CU-HTK English CTS system using more than 2000 hours of data. In Proceedings of the NIST RT-04F Rich Transcription Workshop.Google Scholar
- Gao, Y., Gu, L., and Kuo, H.-K. 2005. Portability challenges in developing interactive dialogue systems. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). vol. V, 1017--1020.Google Scholar
- Gildea, D. 2001. Corpus variation and parser performance. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. L. Lee and D. Harman, Eds. 167--202.Google Scholar
- Godfrey, J., Holliman, E., and McDaniel, J. 1992. Switchboard: Telephone speech corpus for research and development. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). vol. I, 517--520.Google ScholarCross Ref
- Goodman, J. 2001. A bit of progress in language modeling. Comput. Speech Lang. 15, 4, 403--434.Google ScholarDigital Library
- Hain, T., Burget, L., Dines, J., McCowan, I., Karafiat, M., Lincoln, M., Moore, D., Garau, G., Wan, V., Ordelman, R., and Renals, S. 2005. The development of the AMI system for the transcription of speech in meetings. In Proceedings of the Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms. Google ScholarDigital Library
- Hwang, M., Lei, X., Ng, T., Ostendorf, M., Stolcke, A., Wang, W., Zheng, J., and Gadde, V. 2004. Porting Decipher from English to Mandarin. In Proceedings of the NIST RT-04F Rich Transcription Workshop.Google Scholar
- Hwang, M.-Y. et al. 1996. Predicting unseen triphones with senones. IEEE Trans. Speech Audio Process. 4. 412--419.Google Scholar
- Iyer, R. and Ostendorf, M. 1996. Modeling long range dependencies in languages. In Proceedings of the International Conference on Spoken Language Processing (ICSLP). 236--239.Google Scholar
- Iyer, R. and Ostendorf, M. 1997. Transforming out-of-domain estimates to improve in-domain language models. In Proceedings of Eurospeech. vol. 4, 1975--1978.Google Scholar
- Iyer, R. and Ostendorf, M. 1999. Relevance weighting for combining multi-domain data for n-gram language modeling. Comput. Speech Lang. 13, 3, 267--282.Google ScholarDigital Library
- Iyer, R., Ostendorf, M., and Meteer, M. 1997. Analyzing and predicting language model improvements. In IEEE Workshop on Speech Recognition and Understanding Proceedings. 254--261.Google Scholar
- Keller, F. and Lapata, M. 2003. Using the web to obtain frequencies for unseen bigrams. Comput. Linguist. 29, 3, 459--484. Google ScholarDigital Library
- Kilgarriff, A. and Grefenstette, G. 2003. Introduction to the special issue on the web as a corpus. Computat. Linguist. 29, 3, 333--348. Google ScholarDigital Library
- Klakow, D. 2000. Selecting articles from the language model training corpus. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). vol. III, 1695--1698.Google ScholarCross Ref
- Lamel, L., Adda, G., Bilinski, E., and Gauvain, J. L. 2005. Transcribing lectures and seminars. In Proceedings of Interspeech. 1657--1660.Google Scholar
- Lapata, M. and Keller, F. 2005. Unsupervised web-based models for natural language processing. ACM Trans. Speech Lang. Process. 1, 2, 1--31. Google ScholarDigital Library
- Lee, Y.-B. and Myaeng, S. 2002. Text genre classification with genre-revealing and subject-revealing features. In Proceedings of SIGIR. 145--150. Google ScholarDigital Library
- Liu, F.-H., Picheny, M., Srinivasa, P., Mankowski, M., and Chen, J. 1996. Speech recognition on Mandarin CallHome: A large-vocabulary conversational and telephone speech corpus. In Proceedings of the International Conference on Acovstics, Speech and Signal Processing (ICASSP). vol. I, 157--160. Google ScholarDigital Library
- Mahajan, M., Beeferman, D., and Huang, D. 1999. Improved topic-dependent language modeling using information retrieval techniques. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). vol., I, 541--544. Google ScholarDigital Library
- Martin, S., Liermann, J., and Ney, H. 1997. Adaptive topic-dependent language modeling using word-based varigrams. In Proceedings of Eurospeech. vol. 3. 3, 1447--1450.Google Scholar
- Morgan, N., Baron, D., Bhagat, S., Carvey, H., Dhillon, R., Edwards, J., Gelbart, D., Janin, A., Krupski, A., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., and Wooters, C. 2003. Meetings about meetings: Research at ICSI on speech in multiparty conversations. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vol. 4, 740--743.Google Scholar
- Ng, T., Ostendorf, M., Hwang, M.-Y., Siu, M., Bulyko, I., and Lei, X. 2005. Web-data augmented language models for Mandarin conversational speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). 89--593.Google Scholar
- Ratnaparkhi, A. 1996. A maximum entropy part-of-speech tagger. In Proceedings of Empirical Methods in Natural Language Processing Conference. 133--141.Google Scholar
- Ries, K. 1997. A class based approach to domain adaptation and constraint integration for empirical m-gram models. In Proceedings of Eurospeech. 4, 1983--1986.Google Scholar
- Rudnicky, A. 1995. Language modeling with limited domain data. In Proceedings of ARPA Spoken Language Technology Workshop. 66--69.Google Scholar
- Sarikaya, R., Gravano, A., and Gao, Y. 2005. Rapid language model development using external resources for new spoken dialog domains. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vol. I, 573--576.Google Scholar
- Scheytt, P., Geutner, P., and Waibel, A. 1998. Serbo-Croatian LVCSR on the dictation and broadcast news domain. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). vol. 2, 897--900.Google Scholar
- Schwarm, S., Bulyko, I., and Ostendorf, M. 2004. Adaptive language modeling with varied sources to cover new vocabulary items. IEEE Trans. Speech Audio 12, 3, 334--342.Google ScholarCross Ref
- Sethy, A., Georgiou, P., and Narayanan, S. 2005. Building topic-specific language models from webdata using competitive models. In Proceedings of Interspeech. 1293--1296.Google Scholar
- Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., and Richards, C. 2001. Normalization of non-standard words. Comput. Speech Lang. 15, 3, 287--333.Google ScholarDigital Library
- Stolcke, A. 1998. Entropy-based pruning of backoff language models. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop. 270--274.Google Scholar
- Stolcke, A. 2002. SRILM -- an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing (ICSLP). 901--904.Google Scholar
- Stolcke, A., Anguera, X., Boakye, K., Janin, A., Mandal, A., Peskin, B., Wooters, C., and Zheng, J. 2005. Further progress in meeting recognition: The ICSI-SRI spring 2005 speech-to-text evaluation system. In Proceedings of NIST MLMI Meeting Recognition Workshop. Google ScholarDigital Library
- Stolcke, A. et al. 2003. Speech-to-text research at SRI-ICSI-UW. NIST RT-03 Workshop.Google Scholar
- Venkataraman, A. and Wang, W. 2003. Techniques for effective vocabulary selection. In Proceedings of Eurospeech. 245--248.Google Scholar
- Wang, W., Stolcke, A., and Harper, M. 2004. The use of a linguistically motivated language model in conversational speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). vol. I, 261--264.Google Scholar
- Woodland, P. C. and Young, S. J. 1993. The HTK tied-state continuous speech recogniser. In Proceedings of Eurospeech. vol. 3, 2207--2210.Google Scholar
- Xu, P. and Mangu, L. 2005. Using random forest language models in the IBM RT-04 CTS system. In Proceedings of Interspeech. 741--744.Google Scholar
- Yang, Y. and Pedersen, J. 1997. A comparative study on feature selection in text categorization. In Proceedings of the International Conference on Machine Learning. 412--420. Google ScholarDigital Library
- Zhu, Q., Stolcke, A., Chen, B., and Morgan, N. 2005. Using mlp features in SRI's conversational speech recognition system. In Proceedings of Interspeech. 2141--2144.Google Scholar
- Zhu, X. and Rosenfeld, R. 2001. Improving trigram language modeling with the World Wide Web. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). I:533--536.Google Scholar
Index Terms
- Web resources for language modeling in conversational speech recognition
Recommendations
Large vocabulary Russian speech recognition using syntactico-statistical language modeling
Speech is the most natural way of human communication and in order to achieve convenient and efficient human-computer interaction implementation of state-of-the-art spoken language technology is necessary. Research in this area has been traditionally ...
A corpus of read and conversational Austrian German
First large scale speech database for Austrian German.It contains read and conversational speech of 38 speakers.Annotations at the orthographic, segmental and prosodic level.Our analysis demonstrates the highly casual speaking style. This paper presents ...
Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling
AbstractThis article presents the research work on improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling. The speech recognition system is built using a deep neural network–...
Comments