Abstract
This paper presents a system of large-scale language model adaptation for daily generated big-size text corpus using MapReduce in a cloud environment. Our large-scale trigram language model, consisting of 800 million trigram counts, was successfully implemented by a new approach using a representative cloud service (Amazon EC2), and a representative distributed processing framework (Hadoop). The ultimate goal of our research is to find the optimal number of Amazon EC2 instances in the LM adaptation under the time constraint that the daily-generated Twitter texts should be processed within 1 day. Trigram count extraction and model update for language model adaptation were performed for 200 million daily-generated Twitter texts. For trigram count extraction, we found that fewer than 3 h are required to process daily-generated Twitter texts when the number of instances is six. For model update, it was shown that fewer than 20 h are required to perform the model update when the number of instances is 10. Therefore, language model adaptation for daily generated 200 million Twitter texts can be successfully adapted within 24 h using at least 10 instances in Amazon EC2.
Similar content being viewed by others
References
200 Million tweets per day (2011) http://blog.twitter.com/2011/06/200-million-tweets-per-day.html
Amazon Elastic Compute Cloud (2010) http://aws.amazon.com/ec2
Bacchiani M, Riley M, Roark B, Sproat R (2006) MAP adaptation of stochastic grammars. J Comput Speech Lang 20(1):41–68
Bakis R, Chen S, Gopalakrishnan P, Gopinath R, Stephane M, Polymenakos L, Franz M (1997) Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. Proceedings of the speech recognition workshop, pp. 67–72
Bellegada J (2001) An overview of statistical language model adaptation. Proceedings of ITRW on adaptation methods for speech recognition, pp. 165–174
Bellegarda J (2004) Statistical language model adaptation review and perspectives. J Speech Commun 42(1):93–108
Brugnara F, Cettolo M (1995) Improvements in tree-based language model representation. Proceedings of European conference on speech communication and technology, pp. 1797–1800
Burrows M (2006) The chubby lock service for loosely-coupled distributed systems. Proceedings of 7th USENIX symposium on operating systems design and implementation, pp. 335–350
Chang F, Dean J, Ghemawat S, Hsieh W, Deborah A, Wallach B, Chandra T, Fikes A, Gruber R (2006) BigTable: A Distributed Storage System for Structured Data, Proceedings of. In: 7th Conference on USENIX Symposium on Operating Systems Design and Implementation, pp., pp 205–218
Clarkson P, Rosenfeld R (1997) Statistical language modeling using the CMU-Cambridge toolkit. Proceedings of European conference on speech communication and technology, pp. 2707–2710
Dean J, Ghemawa S (2004) MapReduce: Simplied data processing on large clusters, proceedings of operating systems design and implementation, pp.137–150
Federico M (1999) Efficient language model adaptation through MDI estimation. Proceedings of European conference on speech communication technology, pp. 1583–1586
Federico M (2002) Language model adaptation through topic decomposition and MDl estimation. Proceedings of international conference on acoustics, speech and signal processing, pp. 773–776
Gauvain L, Lee H (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. J Speech Audio Process 2(2):291–298
George L (2011) HBase: The definitive guide, O’Reilly Media, pp. 68–71
Langzhou C, Gauvain J, Lamel L, Adda G (2003) Unsupervised language model adaptation for broadcast news. Proceedings of international conference on acoustics, speech and signal processing, pp. 220–223
Masataki H, Sagisaka Y, Tawahara T (1997) Task adaptation using MAP Estimation in N-gram language model. Proceedings of international conference on acoustics, speech, and signal processing, pp. 783–786
Pietra S, Pietra V, Mercer R, Roukos S (1992) Adaptive language modeling using minimum discriminant estimation. Proceedings of international conference on acoustics, speech, and signal processing, pp. 633–636
Rosenfeld R (1996) A maximum entropy approach to adaptive statistical language modeling. J Comput Speech Lang 10(3):187–228
Saraclar M, Bacchiani M (2004) Language Model Adaptation with MAP estimation and the Perceptron algorithm. Proceedings of Human language technology conference and meeting of the North American chapter of the association for computational linguistics, pp. 21–24
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system, proceedings of mass storage systems and technologies, pp. 1–10
Web 1 T 5-gram Version 1 (2006) http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalog Id = LDC2006T1324 LSM
White T (2011) Hadoop: The definitive guide, O’Reilly media, pp. 27–41
Wolberg J (2005) Data analysis using the method of least squares: Extracting the most information from experiments. Springer, pp. 31–64
Acknowledgments
This work was supported by the Industrial Strategic Technology Development Program 10035252, Development of Dialog-based Spontaneous Speech Interface Technology on Mobile Platform funded by the Ministry of Trade, Industry & Energy, Korea.
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper is an extended version of the 3rd International Conference on Intelligent Robotics, Automations, Telecommunication Facilities, and Applications (IRoA 2013).
Rights and permissions
About this article
Cite this article
Kim, KH., Jung, DY., Lee, D. et al. Implementation of a large-scale language model adaptation in a cloud environment. Multimed Tools Appl 75, 5029–5045 (2016). https://doi.org/10.1007/s11042-013-1787-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1787-z