Implementation of a large-scale language model adaptation in a cloud environment

Kim, Kwang-Ho; Jung, Dae-Young; Lee, Donghyun; Lee, Hyuk-Jun; Park, Sung-Yong; Koo, Myoung-Wan; Kim, Ji-Hwan; Park, Jeong-sik; Jeon, Hyung-Bae; Lee, Yun-Keun

doi:10.1007/s11042-013-1787-z

Implementation of a large-scale language model adaptation in a cloud environment

Published: 06 December 2013

Volume 75, pages 5029–5045, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Kwang-Ho Kim¹,
Dae-Young Jung¹,
Donghyun Lee¹,
Hyuk-Jun Lee¹,
Sung-Yong Park¹,
Myoung-Wan Koo¹,
Ji-Hwan Kim¹,
Jeong-sik Park²,
Hyung-Bae Jeon³ &
…
Yun-Keun Lee³

277 Accesses
1 Citation
Explore all metrics

Abstract

This paper presents a system of large-scale language model adaptation for daily generated big-size text corpus using MapReduce in a cloud environment. Our large-scale trigram language model, consisting of 800 million trigram counts, was successfully implemented by a new approach using a representative cloud service (Amazon EC2), and a representative distributed processing framework (Hadoop). The ultimate goal of our research is to find the optimal number of Amazon EC2 instances in the LM adaptation under the time constraint that the daily-generated Twitter texts should be processed within 1 day. Trigram count extraction and model update for language model adaptation were performed for 200 million daily-generated Twitter texts. For trigram count extraction, we found that fewer than 3 h are required to process daily-generated Twitter texts when the number of instances is six. For model update, it was shown that fewer than 20 h are required to perform the model update when the number of instances is 10. Therefore, language model adaptation for daily generated 200 million Twitter texts can be successfully adapted within 24 h using at least 10 instances in Amazon EC2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Implementation of a Large-Scale Language Model in a Cloud Environment for Human–Robot Interaction

The Emergence of Modified Hadoop Online-Based MapReduce Technology in Cloud Environments

Elastic Web Crawler Service-Oriented Architecture Over Cloud Computing

Article 18 May 2018

M. E. ElAraby, Hossam M. Moftah, … M. Z. Rashad

References

200 Million tweets per day (2011) http://blog.twitter.com/2011/06/200-million-tweets-per-day.html
Amazon Elastic Compute Cloud (2010) http://aws.amazon.com/ec2
Bacchiani M, Riley M, Roark B, Sproat R (2006) MAP adaptation of stochastic grammars. J Comput Speech Lang 20(1):41–68
Article Google Scholar
Bakis R, Chen S, Gopalakrishnan P, Gopinath R, Stephane M, Polymenakos L, Franz M (1997) Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. Proceedings of the speech recognition workshop, pp. 67–72
Bellegada J (2001) An overview of statistical language model adaptation. Proceedings of ITRW on adaptation methods for speech recognition, pp. 165–174
Bellegarda J (2004) Statistical language model adaptation review and perspectives. J Speech Commun 42(1):93–108
Article Google Scholar
Brugnara F, Cettolo M (1995) Improvements in tree-based language model representation. Proceedings of European conference on speech communication and technology, pp. 1797–1800
Burrows M (2006) The chubby lock service for loosely-coupled distributed systems. Proceedings of 7th USENIX symposium on operating systems design and implementation, pp. 335–350
Chang F, Dean J, Ghemawat S, Hsieh W, Deborah A, Wallach B, Chandra T, Fikes A, Gruber R (2006) BigTable: A Distributed Storage System for Structured Data, Proceedings of. In: 7th Conference on USENIX Symposium on Operating Systems Design and Implementation, pp., pp 205–218
Google Scholar
Clarkson P, Rosenfeld R (1997) Statistical language modeling using the CMU-Cambridge toolkit. Proceedings of European conference on speech communication and technology, pp. 2707–2710
Dean J, Ghemawa S (2004) MapReduce: Simplied data processing on large clusters, proceedings of operating systems design and implementation, pp.137–150
Federico M (1999) Efficient language model adaptation through MDI estimation. Proceedings of European conference on speech communication technology, pp. 1583–1586
Federico M (2002) Language model adaptation through topic decomposition and MDl estimation. Proceedings of international conference on acoustics, speech and signal processing, pp. 773–776
Gauvain L, Lee H (1994) Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. J Speech Audio Process 2(2):291–298
Article Google Scholar
George L (2011) HBase: The definitive guide, O’Reilly Media, pp. 68–71
Langzhou C, Gauvain J, Lamel L, Adda G (2003) Unsupervised language model adaptation for broadcast news. Proceedings of international conference on acoustics, speech and signal processing, pp. 220–223
Masataki H, Sagisaka Y, Tawahara T (1997) Task adaptation using MAP Estimation in N-gram language model. Proceedings of international conference on acoustics, speech, and signal processing, pp. 783–786
Pietra S, Pietra V, Mercer R, Roukos S (1992) Adaptive language modeling using minimum discriminant estimation. Proceedings of international conference on acoustics, speech, and signal processing, pp. 633–636
Rosenfeld R (1996) A maximum entropy approach to adaptive statistical language modeling. J Comput Speech Lang 10(3):187–228
Article MathSciNet Google Scholar
Saraclar M, Bacchiani M (2004) Language Model Adaptation with MAP estimation and the Perceptron algorithm. Proceedings of Human language technology conference and meeting of the North American chapter of the association for computational linguistics, pp. 21–24
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system, proceedings of mass storage systems and technologies, pp. 1–10
Web 1 T 5-gram Version 1 (2006) http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalog Id = LDC2006T1324 LSM
White T (2011) Hadoop: The definitive guide, O’Reilly media, pp. 27–41
Wolberg J (2005) Data analysis using the method of least squares: Extracting the most information from experiments. Springer, pp. 31–64

Download references

Acknowledgments

This work was supported by the Industrial Strategic Technology Development Program 10035252, Development of Dialog-based Spontaneous Speech Interface Technology on Mobile Platform funded by the Ministry of Trade, Industry & Energy, Korea.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea
Kwang-Ho Kim, Dae-Young Jung, Donghyun Lee, Hyuk-Jun Lee, Sung-Yong Park, Myoung-Wan Koo & Ji-Hwan Kim
Department of Intelligent Robot Engineering, Mokwon University, Daejeon, Republic of Korea
Jeong-sik Park
Electronics and Telecommunications Research Institute, Daejeon, Republic of Korea
Hyung-Bae Jeon & Yun-Keun Lee

Authors

Kwang-Ho Kim
View author publications
You can also search for this author in PubMed Google Scholar
Dae-Young Jung
View author publications
You can also search for this author in PubMed Google Scholar
Donghyun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Hyuk-Jun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Sung-Yong Park
View author publications
You can also search for this author in PubMed Google Scholar
Myoung-Wan Koo
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Hwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jeong-sik Park
View author publications
You can also search for this author in PubMed Google Scholar
Hyung-Bae Jeon
View author publications
You can also search for this author in PubMed Google Scholar
Yun-Keun Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ji-Hwan Kim.

Additional information

This paper is an extended version of the 3rd International Conference on Intelligent Robotics, Automations, Telecommunication Facilities, and Applications (IRoA 2013).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, KH., Jung, DY., Lee, D. et al. Implementation of a large-scale language model adaptation in a cloud environment. Multimed Tools Appl 75, 5029–5045 (2016). https://doi.org/10.1007/s11042-013-1787-z

Download citation

Published: 06 December 2013
Issue Date: May 2016
DOI: https://doi.org/10.1007/s11042-013-1787-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Implementation of a large-scale language model adaptation in a cloud environment

Abstract

Access this article

Similar content being viewed by others

Implementation of a Large-Scale Language Model in a Cloud Environment for Human–Robot Interaction

The Emergence of Modified Hadoop Online-Based MapReduce Technology in Cloud Environments

Elastic Web Crawler Service-Oriented Architecture Over Cloud Computing

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Implementation of a large-scale language model adaptation in a cloud environment

Abstract

Access this article

Similar content being viewed by others

Implementation of a Large-Scale Language Model in a Cloud Environment for Human–Robot Interaction

The Emergence of Modified Hadoop Online-Based MapReduce Technology in Cloud Environments

Elastic Web Crawler Service-Oriented Architecture Over Cloud Computing

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation