Enriching Domain-Specific Language Models Using Domain Independent WWW N-Gram Corpus

Chang, Harry

doi:10.1007/978-3-642-29350-4_5

Harry Chang²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7268))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

1720 Accesses

Abstract

This paper describes the new techniques developed to extract and compute the domain-specific knowledge implicitly embedded in a highly structural ontology-based information system for TV Electronic Programming Guide (EPG). The domain knowledge represented by a set of mutually related n-gram data sets is then enriched by exploring the explicit structural dependencies and implicit semantic association between the data entities in the domain and the domain-independent texts from the Google 1 trillion 5-grams corpus created from general WWW documents. The knowledge-based enrichment process creates the language models required for a natural language based EPG search system that outperform the baseline model created only from the original EPG data source by a significant margin measured by an absolute improvement of 14.1% on the model coverage (recall accuracy) using large-scale test data collected from a real-world EPG search application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. ASSP 35(3), 400–401 (1987)
Article Google Scholar
Eseen, H.N., Kneser, R.: On Structuring Probabilistic Dependencies in Stochastic Language Modeling. Computer, Speech, and Language 8, 1–38 (1994)
Article Google Scholar
Kneser, R., Ney, H.: Improved Backing-off for M-gram Language Modeling. In: Proc. of ICASSP, vol. 1, pp. 181–184 (1995)
Google Scholar
Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98, Harvard University Center for Research in Computing Technology (1998)
Google Scholar
Chelba, C., Acero, A.: Discriminative Training of N-gram Classifier for Speech and Text Routing. In: Proc. of Eurospeech, pp. 1–4 (2003)
Google Scholar
Chen, Z., Lee, K.F., Li, M.J.: Discriminative Training on Language Models. In: Proc. of ICSLP (2000)
Google Scholar
Chang, H.M.: Conceptual Modeling of Online Entertainment Programming Guide for Natural Language Interface. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds.) NLDB 2010. LNCS, vol. 6177, pp. 188–195. Springer, Heidelberg (2010)
Chapter Google Scholar
Chang, H.M.: Constructing N-gram Rules for Natural Language Models through Exploring the Limitations of the Zipf-Mandelbrot Law. Computing 91, 241–264 (2011)
Article MathSciNet MATH Google Scholar
Chang, H.M.: Topics Inference by Weighted Mutual Information Measures Computed from Structured Corpus. In: Muñoz, R., Montoyo, A., Métais, E. (eds.) NLDB 2011. LNCS, vol. 6716, pp. 64–75. Springer, Heidelberg (2011)
Chapter Google Scholar
Ng, T., Ostendorf, M., Hwang, M.Y., Manhung, S., Bulyko, I., Xin, L.: Web-data Augmented Language Models for Mandarin Conversational Speech Recognition. In: Proc. of ICASSP, pp. 589–592 (2005)
Google Scholar
Tsiartas, A., Tsiartas, P., Narayanan, S.: Language Model Adaptation Using WWW Docuements Obtained by Utterance-based Queries. In: ICASSP (2010)
Google Scholar
Brants, T., Franz, A.: Web 1T 5-gram Corpus Version 1.1. Technical Report, Google Research (2006)
Google Scholar
Bille, P.: A Survey on Tree Edit Distance and Related Problems. Theor. Computing Sci. 331(1-3), 217–239 (2005)
Article MathSciNet Google Scholar
Dalamagas, T., Cheng, T., Wintel, K.J., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information System 31(3), 187–228 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Labs, Austin, TX, USA
Harry Chang

Authors

Harry Chang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Częstochowa University of Technology, Armii Krajowej 36, 42-200, Częstochowa, Poland
Leszek Rutkowski , Marcin Korytkowski & Rafał Scherer , &
AGH University of Science and Technology, Mickiewicza 30, 30-059, Kraków, Poland
Ryszard Tadeusiewicz
Department of Electrical Engineering and Computer Sciences, Computer Science Division, University of California Berkeley, 94720-1776, Berkeley, CA, USA
Lotfi A. Zadeh
Computational Intelligence Laboratory, Electrical and Computer Engineering, University of Louisville, 405 Lutz Hall, 40292, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, H. (2012). Enriching Domain-Specific Language Models Using Domain Independent WWW N-Gram Corpus. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2012. Lecture Notes in Computer Science(), vol 7268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29350-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-29350-4_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29349-8
Online ISBN: 978-3-642-29350-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics