Research on Domain Adaptation for SMT Based on Specific Domain Knowledge

He, Yanqing; Ding, Liang; Li, Ying

doi:10.1007/978-981-10-3635-4_5

Yanqing He¹²,
Liang Ding¹² &
Ying Li¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 668))

Included in the following conference series:

China Workshop on Machine Translation

573 Accesses
1 Citations

Abstract

In statistical machine translation, training data usually have the characteristics of diverse sources, multiple themes, different genre, and are often not in accordance with the domain of target text to be translated, resulting in domain adaptive problem. The existing adaptive methods for statistical machine translation aim for the target text and focus on the selection of training data and the adjustment of translation models. These approaches have not specified explicit domain labels for texts or data. This study gives explicit domain labels and uses two examples for specific context knowledge, (1) Domain knowledge based on Chinese Thesaurus are applied to assign domain labels of Chinese Library Classification Number to Chinese texts; (2) Two-dimensional lexicalized domain knowledge, such as Semantic Category and Application Scenarios, is used to label Japanese sentence. Based on the obtained domain labels for development data and test data, the training data can be filtered to achieve the goal of domain consistency. Experiments show that only a part of the training data can gain a comparable translation performance to the whole training data. This shows that the method is efficient and feasible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN.
.

References

Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-volume, North American, pp. 127–133 (2003)
Google Scholar
Lei, C., Ming, Z.: An overview of domain adaptation for statistical machine translation. Intell. Comput. Appl. 4(6), 31–34 (2014)
Google Scholar
Zeng, J., Chang, C.: Function orientation and development of new edition of chinese thesaurus under network environment. J. China Soc. Sci. Tech. Inf. 29(6), 973–977 (2010)
Google Scholar
Chinese Thesaurus. Scientific and Technical Documentation Press (1991)
Google Scholar
Shunian, C.: The first electronic edition of Chinese library classification. Lib. Inf. Serv. 3, 55–60 (2002)
Google Scholar
Eck, M., Vogel, S., Waibel, A.: Low cost portability for statistical machine translation based on n-gram coverage. In: Proceedings of Mtsummit X (2005)
Google Scholar
Zhao, B., Eck, M., Vogel, S.: Language model adaptation for statistical machine translation with structured query models. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 411. Association for Computational Linguistics, The University of Geneva, Switzerland (2004)
Google Scholar
Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. In: EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 28–30 June 2007, Prague, Czech Republic, pp. 343–350 (2007)
Google Scholar
Matsoukas, S., Rosti, A., Zhang, B.: Discriminative corpus weight estimation for machine translation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, vol. 2, pp. 708–717. Association for Computational Linguistics, Singapore (2009)
Google Scholar
Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: ACL 2010, Proceedings of the, Meeting of the Association for Computational Linguistics, 11–16 July 2010, Uppsala, Sweden, Short Papers, pp. 220–224 (2010)
Google Scholar
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics, Edinburgh, UK (2011)
Google Scholar
Shujie, Y., Tong, X., Jingbo, Z.: Selectiion of SMT training data based on sentence pair quality and coverage. J. Chin. Inf. Process. 25(2), 72–77 (2011)
Google Scholar
Foster, G., Kuhn, R.: Mixture model adaptation for SMT. In: Proceedings of Second Workshop on Statistical Machine Translation, pp. 128–135. Association for Computational Linguistics, Prague (2007)
Google Scholar
Civera, J., Juan, A.: Domain adaptation in statistical machine translation with mixture modeling. In: Proceedings of the Second workshop Statistical Machine Translation, pp. 177–180. Association for Computational Linguistics, Prague (2007)
Google Scholar
Koehn, P., Schroeder, J.: Experiments in domain adaptation for statistical machine translation. In: Proceedings of the Second, Workshop on Statistical Machine Translation, pp. 224–227. Association for Computational Linguistics, Prague (2007)
Google Scholar
Finch, A., Sumita, E.: Dynamic model interpolation for statistical machine translation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 208–215. Association for Computational Linguistics, Columbus (2008)
Google Scholar
Foster, G., Goutte, C., Kuhn, R.: Discriminative instance weighting for domain adaptation in statistical machine translation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 451–459. Association for Computational Linguistics, Cambridge (2010)
Google Scholar
Banerjee, P., Naskar, S.K., Roturier, J., et al.: Domain adaptation in statistical machine translation of user-forum data using component level mixture modelling. In: Proceedings of Machine Translation Summit XIII, Xiamen, China, pp. 285–292 (2011)
Google Scholar
Sennrich, R.: Perplexity minimization for translation model domain adaptation in statistical machine translation. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 539–549. Association for Computational Linguistics, Avignon (2012)
Google Scholar
Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th ACL: Shortpapers, pp. 407–412. Association for Computational Linguistics, Portland (2011)
Google Scholar
Ueffing, N., Haffari, G., Sarkar, A.: Semi-supervised model adaptation for statistical machine translation. Mach. Transl. 21, 71–94 (2007)
Article Google Scholar
Wu, H.,Wang, H.,Zong, C.: Domain adaptation for statistical machine translation with domain dictionary and monolingual corploa. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 993–1000. COLING 2008 Organizing Committee, Manchester (2008)
Google Scholar
Schwenk, H.: Investigations on large-scale lightly supervised training for statistical machine translation. In: Proceedings of the International Workshop on Spoken Language Translation, pp. 182–189. IWSLT, Hawaii (2008)
Google Scholar
Zhao, B., Xing, E.P.: BiTAM:Bilingual topic admixture models for word alignment. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 969–976. Association for Computational Linguistics, Sydney (2006)
Google Scholar
Zhao, B., Xing, E.P.: HM-BiTAM: Bilingual topic exploration, word alignment, and translation. In: Advances in Neural Information Processing Systems, pp. 1689–1696. Vancouver, British Columbia (2008)
Google Scholar
Tam, Y.C., Lane, I., Schultz, T.: Bilingual LSA-based adaptation for statistical machine translation.Mach. Transl. 2l(4), 187–207 (2007)
Google Scholar
Su, J.,Wu, H., Wang, H., et a1.: Translation model adaptation for statistical machine translation with monolingual topic information. In: Proceedings of Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 459–468. Association for Computational Linguistics, Jeju (2012)
Google Scholar
Xiao, X., Xiong, D., Zhang, M., et a1.: A topic similarity model for hierarchical phrase-based translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 750–758. Association for Computational Linguistics, Jeju (2012)
Google Scholar
Ding, L., Li, Y., He, Y., Wang, X., Zhang, Y., Yao, C.: Experimental study on training data selection of SMT based on chinese thesaurus. J. China Soc. Sci. Tech. Inf. (accepted)
Google Scholar
Ding, L., Li, Y., He, Y., Liu, J.: Research on Japanese-Chinese S&T terminology translation based-on two-dimensional domain lexicalized domain knowledge. In: CWMT 2016, Urumchi, China, vol. 8, pp. 25–26 (2016)
Google Scholar
Och, F.J., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Meeting on Association for Computational Linguistics, pp. 295–302. Association for Computational Linguistics, Stroudsburg, USA (2002)
Google Scholar
Xiong, D., Liu, Q., Lin, S.: Maximum entropy based phrase reordering model for statistical machine translation. In: Proceedings of COLING-ACL, Sydney, Australia, pp. 521–528 (2006)
Google Scholar
Xiao, T., Zhu, J., Zhang, H., et al.: NiuTrans: an open source toolkit for phrase-based and syntax-based machine translation. In: ACL 2012 System Demonstrations, Jeju, Republic of Korea, pp. 19–24 (2012)
Google Scholar
Hashimoto, C., Kurohashi, S.: Construction of domain dictionary for fundamental vocabulary and its application to automatic blog categorization with the dynamic estimation of unknown words’ domains. J. Nat. Lang. Process. 15(5), 73–97 (2008)
Article Google Scholar
Kurohashi, S., Nakamura, T., Matsumoto, Y., et al.: Improvements of Japanese morphological analyzer JUMAN. In: Proceedings of The International Workshop on Sharable Natural Language, pp. 22–28 (1994)
Google Scholar

Download references

Acknowledgments

This research work was partially supported by National Natural Science of China (61303152, 71503240), and ISTIC Research Foundation Projects (ZD2016-05).

Author information

Authors and Affiliations

Institute of Scientific and Technical Information of China, Beijing, 10038, China
Yanqing He, Liang Ding & Ying Li

Authors

Yanqing He
View author publications
You can also search for this author in PubMed Google Scholar
Liang Ding
View author publications
You can also search for this author in PubMed Google Scholar
Ying Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Li .

Editor information

Editors and Affiliations

Harbin Institute of Technology, Harbin, China
Muyun Yang
Microsoft Research Asia, Beijing, China
Shujie Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, Y., Ding, L., Li, Y. (2016). Research on Domain Adaptation for SMT Based on Specific Domain Knowledge. In: Yang, M., Liu, S. (eds) Machine Translation. CWMT 2016. Communications in Computer and Information Science, vol 668. Springer, Singapore. https://doi.org/10.1007/978-981-10-3635-4_5

Download citation

DOI: https://doi.org/10.1007/978-981-10-3635-4_5
Published: 06 January 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3634-7
Online ISBN: 978-981-10-3635-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics