Skip to main content

Research on Domain Adaptation for SMT Based on Specific Domain Knowledge

  • Conference paper
  • First Online:
Machine Translation (CWMT 2016)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 668))

Included in the following conference series:

Abstract

In statistical machine translation, training data usually have the characteristics of diverse sources, multiple themes, different genre, and are often not in accordance with the domain of target text to be translated, resulting in domain adaptive problem. The existing adaptive methods for statistical machine translation aim for the target text and focus on the selection of training data and the adjustment of translation models. These approaches have not specified explicit domain labels for texts or data. This study gives explicit domain labels and uses two examples for specific context knowledge, (1) Domain knowledge based on Chinese Thesaurus are applied to assign domain labels of Chinese Library Classification Number to Chinese texts; (2) Two-dimensional lexicalized domain knowledge, such as Semantic Category and ApplicationĀ Scenarios, is used to label Japanese sentence. Based on the obtained domain labels for development data and test data, the training data can be filtered to achieve the goal of domain consistency. Experiments show that only a part of the training data can gain a comparable translation performance to the whole training data. This shows that the method is efficient and feasible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN.

    .

References

  1. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-volume, North American, pp. 127ā€“133 (2003)

    Google ScholarĀ 

  2. Lei, C., Ming, Z.: An overview of domain adaptation for statistical machine translation. Intell. Comput. Appl. 4(6), 31ā€“34 (2014)

    Google ScholarĀ 

  3. Zeng, J., Chang, C.: Function orientation and development of new edition of chinese thesaurus under network environment. J. China Soc. Sci. Tech. Inf. 29(6), 973ā€“977 (2010)

    Google ScholarĀ 

  4. Chinese Thesaurus. Scientific and Technical Documentation Press (1991)

    Google ScholarĀ 

  5. Shunian, C.: The first electronic edition of Chinese library classification. Lib. Inf. Serv. 3, 55ā€“60 (2002)

    Google ScholarĀ 

  6. Eck, M., Vogel, S., Waibel, A.: Low cost portability for statistical machine translation based on n-gram coverage. In: Proceedings of Mtsummit X (2005)

    Google ScholarĀ 

  7. Zhao, B., Eck, M., Vogel, S.: Language model adaptation for statistical machine translation with structured query models. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 411. Association for Computational Linguistics, The University of Geneva, Switzerland (2004)

    Google ScholarĀ 

  8. LĆ¼, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. In: EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 28ā€“30 June 2007, Prague, Czech Republic, pp. 343ā€“350 (2007)

    Google ScholarĀ 

  9. Matsoukas, S., Rosti, A., Zhang, B.: Discriminative corpus weight estimation for machine translation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, vol. 2, pp. 708ā€“717. Association for Computational Linguistics, Singapore (2009)

    Google ScholarĀ 

  10. Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: ACL 2010, Proceedings of the, Meeting of the Association for Computational Linguistics, 11ā€“16 July 2010, Uppsala, Sweden, Short Papers, pp. 220ā€“224 (2010)

    Google ScholarĀ 

  11. Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355ā€“362. Association for Computational Linguistics, Edinburgh, UK (2011)

    Google ScholarĀ 

  12. Shujie, Y., Tong, X., Jingbo, Z.: Selectiion of SMT training data based on sentence pair quality and coverage. J. Chin. Inf. Process. 25(2), 72ā€“77 (2011)

    Google ScholarĀ 

  13. Foster, G., Kuhn, R.: Mixture model adaptation for SMT. In: Proceedings of Second Workshop on Statistical Machine Translation, pp. 128ā€“135. Association for Computational Linguistics, Prague (2007)

    Google ScholarĀ 

  14. Civera, J., Juan, A.: Domain adaptation in statistical machine translation with mixture modeling. In: Proceedings of the Second workshop Statistical Machine Translation, pp. 177ā€“180. Association for Computational Linguistics, Prague (2007)

    Google ScholarĀ 

  15. Koehn, P., Schroeder, J.: Experiments in domain adaptation for statistical machine translation. In: Proceedings of the Second, Workshop on Statistical Machine Translation, pp. 224ā€“227. Association for Computational Linguistics, Prague (2007)

    Google ScholarĀ 

  16. Finch, A., Sumita, E.: Dynamic model interpolation for statistical machine translation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 208ā€“215. Association for Computational Linguistics, Columbus (2008)

    Google ScholarĀ 

  17. Foster, G., Goutte, C., Kuhn, R.: Discriminative instance weighting for domain adaptation in statistical machine translation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 451ā€“459. Association for Computational Linguistics, Cambridge (2010)

    Google ScholarĀ 

  18. Banerjee, P., Naskar, S.K., Roturier, J., et al.: Domain adaptation in statistical machine translation of user-forum data using component level mixture modelling. In: Proceedings of Machine Translation Summit XIII, Xiamen, China, pp. 285ā€“292 (2011)

    Google ScholarĀ 

  19. Sennrich, R.: Perplexity minimization for translation model domain adaptation in statistical machine translation. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 539ā€“549. Association for Computational Linguistics, Avignon (2012)

    Google ScholarĀ 

  20. DaumĆ© III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th ACL: Shortpapers, pp. 407ā€“412. Association for Computational Linguistics, Portland (2011)

    Google ScholarĀ 

  21. Ueffing, N., Haffari, G., Sarkar, A.: Semi-supervised model adaptation for statistical machine translation. Mach. Transl. 21, 71ā€“94 (2007)

    ArticleĀ  Google ScholarĀ 

  22. Wu, H.,Wang, H.,Zong, C.: Domain adaptation for statistical machine translation with domain dictionary and monolingual corploa. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 993ā€“1000. COLING 2008 Organizing Committee, Manchester (2008)

    Google ScholarĀ 

  23. Schwenk, H.: Investigations on large-scale lightly supervised training for statistical machine translation. In: Proceedings of the International Workshop on Spoken Language Translation, pp. 182ā€“189. IWSLT, Hawaii (2008)

    Google ScholarĀ 

  24. Zhao, B., Xing, E.P.: BiTAM:Bilingual topic admixture models for word alignment. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 969ā€“976. Association for Computational Linguistics, Sydney (2006)

    Google ScholarĀ 

  25. Zhao, B., Xing, E.P.: HM-BiTAM: Bilingual topic exploration, word alignment, and translation. In: Advances in Neural Information Processing Systems, pp. 1689ā€“1696. Vancouver, British Columbia (2008)

    Google ScholarĀ 

  26. Tam, Y.C., Lane, I., Schultz, T.: Bilingual LSA-based adaptation for statistical machine translation.Mach. Transl. 2l(4), 187ā€“207 (2007)

    Google ScholarĀ 

  27. Su, J.,Wu, H., Wang, H., et a1.: Translation model adaptation for statistical machine translation with monolingual topic information. In: Proceedings of Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 459ā€“468. Association for Computational Linguistics, Jeju (2012)

    Google ScholarĀ 

  28. Xiao, X., Xiong, D., Zhang, M., et a1.: A topic similarity model for hierarchical phrase-based translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 750ā€“758. Association for Computational Linguistics, Jeju (2012)

    Google ScholarĀ 

  29. Ding, L., Li, Y., He, Y., Wang, X., Zhang, Y., Yao, C.: Experimental study on training data selection of SMT based on chinese thesaurus. J. China Soc. Sci. Tech. Inf. (accepted)

    Google ScholarĀ 

  30. Ding, L., Li, Y., He, Y., Liu, J.: Research on Japanese-Chinese S&T terminology translation based-on two-dimensional domain lexicalized domain knowledge. In: CWMT 2016, Urumchi, China, vol. 8, pp. 25ā€“26 (2016)

    Google ScholarĀ 

  31. Och, F.J., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Meeting on Association for Computational Linguistics, pp. 295ā€“302. Association for Computational Linguistics, Stroudsburg, USA (2002)

    Google ScholarĀ 

  32. Xiong, D., Liu, Q., Lin, S.: Maximum entropy based phrase reordering model for statistical machine translation. In: Proceedings of COLING-ACL, Sydney, Australia, pp. 521ā€“528 (2006)

    Google ScholarĀ 

  33. Xiao, T., Zhu, J., Zhang, H., et al.: NiuTrans: an open source toolkit for phrase-based and syntax-based machine translation. In: ACL 2012 System Demonstrations, Jeju, Republic of Korea, pp. 19ā€“24 (2012)

    Google ScholarĀ 

  34. Hashimoto, C., Kurohashi, S.: Construction of domain dictionary for fundamental vocabulary and its application to automatic blog categorization with the dynamic estimation of unknown wordsā€™ domains. J. Nat. Lang. Process. 15(5), 73ā€“97 (2008)

    ArticleĀ  Google ScholarĀ 

  35. Kurohashi, S., Nakamura, T., Matsumoto, Y., et al.: Improvements of Japanese morphological analyzer JUMAN. In: Proceedings of The International Workshop on Sharable Natural Language, pp. 22ā€“28 (1994)

    Google ScholarĀ 

Download references

Acknowledgments

This research work was partially supported by National Natural Science of China (61303152, 71503240), and ISTIC Research Foundation Projects (ZD2016-05).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2016 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

He, Y., Ding, L., Li, Y. (2016). Research on Domain Adaptation for SMT Based on Specific Domain Knowledge. In: Yang, M., Liu, S. (eds) Machine Translation. CWMT 2016. Communications in Computer and Information Science, vol 668. Springer, Singapore. https://doi.org/10.1007/978-981-10-3635-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-3635-4_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-3634-7

  • Online ISBN: 978-981-10-3635-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics