Skip to main content

Enriching Domain-Specific Language Models Using Domain Independent WWW N-Gram Corpus

  • Conference paper
Artificial Intelligence and Soft Computing (ICAISC 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7268))

Included in the following conference series:

  • 1720 Accesses

Abstract

This paper describes the new techniques developed to extract and compute the domain-specific knowledge implicitly embedded in a highly structural ontology-based information system for TV Electronic Programming Guide (EPG). The domain knowledge represented by a set of mutually related n-gram data sets is then enriched by exploring the explicit structural dependencies and implicit semantic association between the data entities in the domain and the domain-independent texts from the Google 1 trillion 5-grams corpus created from general WWW documents. The knowledge-based enrichment process creates the language models required for a natural language based EPG search system that outperform the baseline model created only from the original EPG data source by a significant margin measured by an absolute improvement of 14.1% on the model coverage (recall accuracy) using large-scale test data collected from a real-world EPG search application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. ASSP 35(3), 400–401 (1987)

    Article  Google Scholar 

  2. Eseen, H.N., Kneser, R.: On Structuring Probabilistic Dependencies in Stochastic Language Modeling. Computer, Speech, and Language 8, 1–38 (1994)

    Article  Google Scholar 

  3. Kneser, R., Ney, H.: Improved Backing-off for M-gram Language Modeling. In: Proc. of ICASSP, vol. 1, pp. 181–184 (1995)

    Google Scholar 

  4. Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98, Harvard University Center for Research in Computing Technology (1998)

    Google Scholar 

  5. Chelba, C., Acero, A.: Discriminative Training of N-gram Classifier for Speech and Text Routing. In: Proc. of Eurospeech, pp. 1–4 (2003)

    Google Scholar 

  6. Chen, Z., Lee, K.F., Li, M.J.: Discriminative Training on Language Models. In: Proc. of ICSLP (2000)

    Google Scholar 

  7. Chang, H.M.: Conceptual Modeling of Online Entertainment Programming Guide for Natural Language Interface. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds.) NLDB 2010. LNCS, vol. 6177, pp. 188–195. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  8. Chang, H.M.: Constructing N-gram Rules for Natural Language Models through Exploring the Limitations of the Zipf-Mandelbrot Law. Computing 91, 241–264 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  9. Chang, H.M.: Topics Inference by Weighted Mutual Information Measures Computed from Structured Corpus. In: Muñoz, R., Montoyo, A., Métais, E. (eds.) NLDB 2011. LNCS, vol. 6716, pp. 64–75. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Ng, T., Ostendorf, M., Hwang, M.Y., Manhung, S., Bulyko, I., Xin, L.: Web-data Augmented Language Models for Mandarin Conversational Speech Recognition. In: Proc. of ICASSP, pp. 589–592 (2005)

    Google Scholar 

  11. Tsiartas, A., Tsiartas, P., Narayanan, S.: Language Model Adaptation Using WWW Docuements Obtained by Utterance-based Queries. In: ICASSP (2010)

    Google Scholar 

  12. Brants, T., Franz, A.: Web 1T 5-gram Corpus Version 1.1. Technical Report, Google Research (2006)

    Google Scholar 

  13. Bille, P.: A Survey on Tree Edit Distance and Related Problems. Theor. Computing Sci. 331(1-3), 217–239 (2005)

    Article  MathSciNet  Google Scholar 

  14. Dalamagas, T., Cheng, T., Wintel, K.J., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information System 31(3), 187–228 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chang, H. (2012). Enriching Domain-Specific Language Models Using Domain Independent WWW N-Gram Corpus. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2012. Lecture Notes in Computer Science(), vol 7268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29350-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29350-4_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29349-8

  • Online ISBN: 978-3-642-29350-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics