research-article

On an Empirical Study of Smoothing Techniques for a Tiny Language Model

Authors:

Freha Mezzoudj,

Abdelkader BenyettouAuthors Info & Claims

IPAC '15: Proceedings of the International Conference on Intelligent Information Processing, Security and Advanced Communication

Article No.: 14, Pages 1 - 5

https://doi.org/10.1145/2816839.2816878

Published: 23 November 2015 Publication History

Abstract

The language models (LM) are an important module in many areas of natural language processing, in particular speech recognition and machine translation. In this experimental work, we present the most popular smoothing methods and their effects on statistical language modelling. We compare the behavior of twelve smoothing algorithms that have been developed in speech and natural language processing fields, using a small but novel text corpus of French radio show transcription to construct and improve tiny language models. The perplexity (average word branching factor), which measures the performance of our LM, ranked from 195.9 to 165.4. The best result is obtained by Modified Kneser-Ney algorithm, with the interpolation version. The details of the experimentation are given. We consider the obtained results good and in agreement with the literature.

References

[1]

R. D. Brown. Finding and identifying text in 900+ languages. Digital Investigation, 9:S34--S43, 2012.

[2]

S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, pages 310--318. Association for Computational Linguistics, 1996.

Digital Library

[3]

S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359--393, 1999.

Digital Library

[4]

C. Dalvi, Bhavana aand Xiong and J. Callan. A language modeling approach to entity recognition and disambiguation for search queries. In Proceedings of the first international workshop on Entity recognition & disambiguation, pages 45--54. ACM, 2014.

Digital Library

[5]

J. Dumoulin. Smoothing of ngram language models of human chats. In Soft Computing and Intelligent Systems (SCIS) and 13th International Symposium on Advanced Intelligent Systems (ISIS), 2012 Joint 6th International Conference on, pages 1--4. IEEE, 2012.

[6]

J. T. Goodman. A bit of progress in language modeling. Computer Speech & Language, 15(4):403--434, 2001.

Digital Library

[7]

P. Goyal, L. Behera, and T. M. McGinnity. A novel neighborhood based document smoothing model for information retrieval. Information retrieval, 16(3):391--425, 2013.

Digital Library

[8]

G. Gravier and G. Adda. Evaluations en traitement automatique de la parole (etape). Evaluation Plan, Etape, 2011.

[9]

M. Hamdani, P. Doetsch, M. Kozielski, A. E.-D. Mousa, and H. Ney. The rwth large vocabulary arabic handwriting recognition system. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on, pages 111--115. IEEE, 2014.

Digital Library

[10]

A. Hasan, S. Islam, and M. Rahman. A comparative study of witten bell and kneser-ney smoothing methods for statistical machine translation. Journal of Information Technology, 1:1--6, 2012.

[11]

K. Heafield. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187--197. Association for Computational Linguistics, 2011.

Digital Library

[12]

P. Koehn and B. Haddow. Towards effective use of training data in statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 317--321. Association for Computational Linguistics, 2012.

Digital Library

[13]

L. Lamel, J.-L. Gauvain, V. B. Le, I. Oparin, and S. Meng. Improved models for mandarin speech-to-text transcription. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 4660--4663. IEEE, 2011.

[14]

N. Madnani. Querying and serving n-gram language models with python. The Python Papers, 4(2):2009, 2009.

[15]

L. R. Rabiner. Speech recognition in machines. The MIT Encyclopedia of the Cognitive Sciences, R. Wilson and F. K. Keil, 1999.

[16]

L. R. Rabiner and B. Juang. Statistical methods for the recognition and understanding of speech. Encyclopedia of language and linguistics, 2004.

[17]

R. Rosenfield. Two decades of statistical language modeling: Where do we go from here? 2000.

[18]

A. Rousseau, P. Dehéglise, and Y. Estève. Enhancing the ted-lium corpus with selected data for language modeling and more ted talks. In Proc. of LREC, pages 3935--3939, 2014.

[19]

H. Schwenk, A. Rousseau, and M. Attik. Large, pruned or continuous space language models on a gpu for statistical machine translation. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, pages 11--19. Association for Computational Linguistics, 2012.

Digital Library

[20]

R. Sennrich. Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 539--549. Association for Computational Linguistics, 2012.

Digital Library

[21]

A. Stolcke et al. Srilm-an extensible language modeling toolkit. In INTERSPEECH, 2002.

[22]

M. Sundermeyer, R. Schlüter, and H. Ney. On the estimation of discount parameters for language model smoothing. In INTERSPEECH, pages 1433--1436, 2011.

[23]

C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 334--342. ACM, 2001.

Digital Library

Cited By

Zhao FTian ZJin H(2018)Entity-Based Language Model Smoothing Approach for Smart SearchIEEE Access10.1109/ACCESS.2017.27884176(9991-10002)Online publication date: 2018
https://doi.org/10.1109/ACCESS.2017.2788417

On an Empirical Study of Smoothing Techniques for a Tiny Language Model
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis

Recommendations

Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system

This paper describes a new technique of language modeling for a highly inflectional Dravidian language, Tamil. It aims to alleviate the main problems encountered in processing of Tamil language, like enormous vocabulary growth caused by the large number ...
Topic-Dependent Language Model with Voting on Noun History

Language models (LMs) are an important field of study in automatic speech recognition (ASR) systems. LM helps acoustic models find the corresponding word sequence of a given speech signal. Without it, ASR systems would not understand the language and it ...
Multi class-based n-gram language model for new words using web data
ROCOM'11/MUSP'11: Proceedings of the 11th WSEAS international conference on robotics, control and manufacturing technology, and 11th WSEAS international conference on Multimedia systems & signal processing

Out-of-vocabulary (OOV) words cause a serious problem for automatic speech recognition (ASR) system. Not only it will be miss-recognized as an in-vocabulary word with similar phonetics, but the error will also affect nearby words to make errors. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

IPAC '15: Proceedings of the International Conference on Intelligent Information Processing, Security and Advanced Communication

November 2015

495 pages

ISBN:9781450334587

DOI:10.1145/2816839

Editors:
Djallel Eddine Boubiche
University of Batna 2, Algeria
,
Faouzi Hidoussi
University of Batna 2, Algeria
,
Homero Toral Cruz
University of Quintana Roo, Mexico

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

IPAC '15

IPAC '15: International Conference on Intelligent Information Processing, Security and Advanced Communication

November 23 - 25, 2015

Batna, Algeria

Acceptance Rates

Overall Acceptance Rate 87 of 367 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
45
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao FTian ZJin H(2018)Entity-Based Language Model Smoothing Approach for Smart SearchIEEE Access10.1109/ACCESS.2017.27884176(9991-10002)Online publication date: 2018
https://doi.org/10.1109/ACCESS.2017.2788417

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten