skip to main content
10.1145/2816839.2816878acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiciipConference Proceedingsconference-collections
research-article

On an Empirical Study of Smoothing Techniques for a Tiny Language Model

Published: 23 November 2015 Publication History

Abstract

The language models (LM) are an important module in many areas of natural language processing, in particular speech recognition and machine translation. In this experimental work, we present the most popular smoothing methods and their effects on statistical language modelling. We compare the behavior of twelve smoothing algorithms that have been developed in speech and natural language processing fields, using a small but novel text corpus of French radio show transcription to construct and improve tiny language models. The perplexity (average word branching factor), which measures the performance of our LM, ranked from 195.9 to 165.4. The best result is obtained by Modified Kneser-Ney algorithm, with the interpolation version. The details of the experimentation are given. We consider the obtained results good and in agreement with the literature.

References

[1]
R. D. Brown. Finding and identifying text in 900+ languages. Digital Investigation, 9:S34--S43, 2012.
[2]
S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, pages 310--318. Association for Computational Linguistics, 1996.
[3]
S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359--393, 1999.
[4]
C. Dalvi, Bhavana aand Xiong and J. Callan. A language modeling approach to entity recognition and disambiguation for search queries. In Proceedings of the first international workshop on Entity recognition & disambiguation, pages 45--54. ACM, 2014.
[5]
J. Dumoulin. Smoothing of ngram language models of human chats. In Soft Computing and Intelligent Systems (SCIS) and 13th International Symposium on Advanced Intelligent Systems (ISIS), 2012 Joint 6th International Conference on, pages 1--4. IEEE, 2012.
[6]
J. T. Goodman. A bit of progress in language modeling. Computer Speech & Language, 15(4):403--434, 2001.
[7]
P. Goyal, L. Behera, and T. M. McGinnity. A novel neighborhood based document smoothing model for information retrieval. Information retrieval, 16(3):391--425, 2013.
[8]
G. Gravier and G. Adda. Evaluations en traitement automatique de la parole (etape). Evaluation Plan, Etape, 2011.
[9]
M. Hamdani, P. Doetsch, M. Kozielski, A. E.-D. Mousa, and H. Ney. The rwth large vocabulary arabic handwriting recognition system. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on, pages 111--115. IEEE, 2014.
[10]
A. Hasan, S. Islam, and M. Rahman. A comparative study of witten bell and kneser-ney smoothing methods for statistical machine translation. Journal of Information Technology, 1:1--6, 2012.
[11]
K. Heafield. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187--197. Association for Computational Linguistics, 2011.
[12]
P. Koehn and B. Haddow. Towards effective use of training data in statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 317--321. Association for Computational Linguistics, 2012.
[13]
L. Lamel, J.-L. Gauvain, V. B. Le, I. Oparin, and S. Meng. Improved models for mandarin speech-to-text transcription. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 4660--4663. IEEE, 2011.
[14]
N. Madnani. Querying and serving n-gram language models with python. The Python Papers, 4(2):2009, 2009.
[15]
L. R. Rabiner. Speech recognition in machines. The MIT Encyclopedia of the Cognitive Sciences, R. Wilson and F. K. Keil, 1999.
[16]
L. R. Rabiner and B. Juang. Statistical methods for the recognition and understanding of speech. Encyclopedia of language and linguistics, 2004.
[17]
R. Rosenfield. Two decades of statistical language modeling: Where do we go from here? 2000.
[18]
A. Rousseau, P. Dehéglise, and Y. Estève. Enhancing the ted-lium corpus with selected data for language modeling and more ted talks. In Proc. of LREC, pages 3935--3939, 2014.
[19]
H. Schwenk, A. Rousseau, and M. Attik. Large, pruned or continuous space language models on a gpu for statistical machine translation. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, pages 11--19. Association for Computational Linguistics, 2012.
[20]
R. Sennrich. Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 539--549. Association for Computational Linguistics, 2012.
[21]
A. Stolcke et al. Srilm-an extensible language modeling toolkit. In INTERSPEECH, 2002.
[22]
M. Sundermeyer, R. Schlüter, and H. Ney. On the estimation of discount parameters for language model smoothing. In INTERSPEECH, pages 1433--1436, 2011.
[23]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 334--342. ACM, 2001.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
IPAC '15: Proceedings of the International Conference on Intelligent Information Processing, Security and Advanced Communication
November 2015
495 pages
ISBN:9781450334587
DOI:10.1145/2816839
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 November 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Absolute discounting
  2. Additive smoothing
  3. Kneser-Ney smoothing
  4. Language model
  5. Witten-bell smoothing
  6. backoff
  7. interpolation
  8. n-gram
  9. perplexity

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

IPAC '15

Acceptance Rates

Overall Acceptance Rate 87 of 367 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media