Abstract
This paper describes the new techniques developed to extract and compute the domain-specific knowledge implicitly embedded in a highly structural ontology-based information system for TV Electronic Programming Guide (EPG). The domain knowledge represented by a set of mutually related n-gram data sets is then enriched by exploring the explicit structural dependencies and implicit semantic association between the data entities in the domain and the domain-independent texts from the Google 1 trillion 5-grams corpus created from general WWW documents. The knowledge-based enrichment process creates the language models required for a natural language based EPG search system that outperform the baseline model created only from the original EPG data source by a significant margin measured by an absolute improvement of 14.1% on the model coverage (recall accuracy) using large-scale test data collected from a real-world EPG search application.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. ASSP 35(3), 400–401 (1987)
Eseen, H.N., Kneser, R.: On Structuring Probabilistic Dependencies in Stochastic Language Modeling. Computer, Speech, and Language 8, 1–38 (1994)
Kneser, R., Ney, H.: Improved Backing-off for M-gram Language Modeling. In: Proc. of ICASSP, vol. 1, pp. 181–184 (1995)
Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98, Harvard University Center for Research in Computing Technology (1998)
Chelba, C., Acero, A.: Discriminative Training of N-gram Classifier for Speech and Text Routing. In: Proc. of Eurospeech, pp. 1–4 (2003)
Chen, Z., Lee, K.F., Li, M.J.: Discriminative Training on Language Models. In: Proc. of ICSLP (2000)
Chang, H.M.: Conceptual Modeling of Online Entertainment Programming Guide for Natural Language Interface. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds.) NLDB 2010. LNCS, vol. 6177, pp. 188–195. Springer, Heidelberg (2010)
Chang, H.M.: Constructing N-gram Rules for Natural Language Models through Exploring the Limitations of the Zipf-Mandelbrot Law. Computing 91, 241–264 (2011)
Chang, H.M.: Topics Inference by Weighted Mutual Information Measures Computed from Structured Corpus. In: Muñoz, R., Montoyo, A., Métais, E. (eds.) NLDB 2011. LNCS, vol. 6716, pp. 64–75. Springer, Heidelberg (2011)
Ng, T., Ostendorf, M., Hwang, M.Y., Manhung, S., Bulyko, I., Xin, L.: Web-data Augmented Language Models for Mandarin Conversational Speech Recognition. In: Proc. of ICASSP, pp. 589–592 (2005)
Tsiartas, A., Tsiartas, P., Narayanan, S.: Language Model Adaptation Using WWW Docuements Obtained by Utterance-based Queries. In: ICASSP (2010)
Brants, T., Franz, A.: Web 1T 5-gram Corpus Version 1.1. Technical Report, Google Research (2006)
Bille, P.: A Survey on Tree Edit Distance and Related Problems. Theor. Computing Sci. 331(1-3), 217–239 (2005)
Dalamagas, T., Cheng, T., Wintel, K.J., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information System 31(3), 187–228 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chang, H. (2012). Enriching Domain-Specific Language Models Using Domain Independent WWW N-Gram Corpus. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2012. Lecture Notes in Computer Science(), vol 7268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29350-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-29350-4_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29349-8
Online ISBN: 978-3-642-29350-4
eBook Packages: Computer ScienceComputer Science (R0)