Skip to main content
Log in

Slavic languages in phrase-based statistical machine translation: a survey

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

The demand for translations is increasing at a rate far beyond the capacity of professional translators. It is too difficult, time consuming and expensive to translate everything from scratch in each language. Machine translation offers a solution, as it provides translation automatically. Until recently, statistical machine translation has proved to be one of the most successful approaches. However, a new approach to machine translation based on neural networks has emerged with promising results. The present paper concerns phrase-based statistical machine translation, an area that has been extensively studied in the literature. The translation system consists of many components built on the premise of probabilities. Each component is described separately. Although high quality translation systems have been developed for certain language pairs, there is still a large number of languages that cause many translation errors. Languages with a rich morphology pose an especially difficult challenge for research. We address one group of morphologically rich languages: Slavic languages, which constitute a relatively homogeneous family of languages characterized by rich, inflectional morphology. The present paper offers a comprehensive survey of approaches to coping with Slavic languages in different aspects of statistical machine translation. We observe that the interest of the community in research of more difficult languages is increasing and we believe that the translation quality of those languages will reach the level of practical use in the near future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://www.meta-net.eu/whitepapers/key-results-and-cross-language-comparison.

References

  • Agić Ž, Merkler D, Berović D (2013) Parsing Croatian and Serbian by using Croatian dependency treebanks. In: Proceedings of the fourth workshop on statistical parsing of morphologically-rich languages. Seattle, Washington, USA, pp 22–33

  • Alumäe T, Kurimo M (2010) Efficient estimation of maximum entropy language models with N-gram features: an SRILM extension. In: Proceedings of Interspeech 2010. Chiba, Japan, pp 1820–1823

  • Arčan M, Popović M, Buitelaar P (2016) Asistent A machine translation system for Slovene, Serbian and Croatian. In: Proceedings of the conference on language technologies & digital humanities. Ljubljana, Slovenia, pp 13–20

  • Avramidis E, Koehn P (2008) Enriching morphologically poor languages for statistical machine translation. In: Proceedings of ACL-08: HLT. Association for Computational Linguistics, Columbus, Ohio, pp 763–770

  • Baerman M (2015) The Oxford handbook of inflection. Oxford University Press, Oxford

    Book  Google Scholar 

  • Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473

  • Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL 2005 workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization, pp 65–72

  • Bertoldi N, Haddow B, Fouet JB (2010) Improved minimum error rate training in Moses. Prague Bull Math Linguist 91:7–16

    Google Scholar 

  • Bilmes JA, Kirchhoff K (2003) Factored language models and generalized parallel backoff. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology: companion volume of the proceedings of HLT-NAACL 2003-short papers, vol 2. Association for Computational Linguistics, Edmonton, Canada, pp 4–6

  • Bisazza A, Monz C (2014) Class-based language modeling for translating into morphologically rich languages. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pp 1918–1927

  • Bohnet B, Nivre J, Boguslavsky IM, Farkas R, Ginter F, Hajič J (2013) Joint morphological and syntactic analysis for richly inflected languages. Trans Assoc Comput Linguist 1:429–440

    Article  Google Scholar 

  • Bojar O (2007) English-to-Czech factored machine translation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, Association for Computational Linguistics, pp 232–239

  • Bojar O (2011) Analyzing error types in English-Czech machine translation. Prague Bull Math Linguist 95:63–76

    Article  Google Scholar 

  • Bojar O, Čmejrek M (2007) Mathematical model of tree transformations. Public deliverable D3.2, EuroMatrix, IST-034291

  • Bojar O, Hajič J (2008) Phrase-based and deep syntactic English-to-Czech statistical machine translation. In: Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, Columbus, Ohio, USA, pp 143–146

  • Bojar O, Kos K (2010) 2010 Failures in English-Czech phrase-based MT. In: Proceedings of the joint fifth workshop on statistical machine translation and metrics (MATR). Association for Computational Linguistics, Uppsala, Sweden, pp 60–66

  • Bojar O, Prokopová M (2006) Czech-English word alignment. In: Proceedings of the international conference on language resources and evaluation, pp 1236–1239

  • Bojar O, Tamchyna A (2011) Forms wanted: training SMT on monolingual data. Abstract at machine translation and morphologically-rich languages. In: Research workshop of the Israel Science Foundation University of Haifa, Israel

  • Bojar O, Wu D (2012) Towards a predicate-argument evaluation for MT. In: Proceedings of the sixth workshop on syntax, semantics and structure in statistical translation (SSST). Jeju, Republic of Korea, Association for Computational Linguistics, pp 30–38

  • Bojar O, Zeman D (2014) Czech machine translation in the project CzechMATE. Prague Bull Math Linguist 101:71–96

    Article  Google Scholar 

  • Bojar O, Matusov E, Ney H (2006) Czech-English phrase-based machine translation. In: Proceedings of the 5th international conference on NLP (FinTAL). Turku, Finland, pp 214–224

  • Bojar O, Kos K, Mareček D (2010) Tackling sparse data issue in machine translation evaluation. In: Proceedings of the ACL 2010 conference short papers. Association for Computational Linguistics, Uppsala, Sweden, pp 86–91

  • Bojar O, Jawaid B, Kamran A (2012) Probes in a taxonomy of factored phrase-based models. In: Proceedings of the 7th workshop on statistical machine translation. Association for Computational Linguistics, Montréal, Canada, pp 253–260

  • Bojar O, Macháček M, Tamchyna A, Zeman D (2013a) Scratching the surface of possible translations. In; Proceedings of the 16th international conference text. Plzeň, Czech Republic, Speech and Dialogue, pp 465–474

  • Bojar O, Rosa R, Tamchyna A (2013b) Chimera—three heads for English-to-Czech translation. In: Proceedings of the eighth workshop on statistical machine translation. Association for Computational Linguistics, Sofia, Bulgaria, pp 92–98

  • Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno Yepes A, Koehn P, Logacheva V, Monz C, Negri M, Neveol A, Neves M, Popel M, Post M, Rubino R, Scarton C, Specia L, Turchi M, Verspoor K, Zampieri M (2016) Findings of the 2016 conference on machine translation. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 131–198

  • Botha JA, Blunsom P (2014) Compositional morphology for word representations and language modelling. In: Proceedings of the 31st international conference on machine learning. Beijing, China, pp 1899–1907

  • Brown PF, Pietra SAD, Pietra VJD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311

    Google Scholar 

  • Brychcín T, Konopík M (2011) Morphological based language models for inflectional languages. IN: The 6th IEEE international conference on intelligent data acquisition and advanced computing systems: technology and applications. Czech Republic, Prague, pp 560–564

  • Brychcín T, Konopík M (2015) HPS: high precision stemmer. Inf Process Manag 51(1):68–91

    Article  Google Scholar 

  • Burlot F, Yvon F (2015) Morphology-aware alignments for translation to and from a synthetic language. In: Proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 188–195

  • Cettolo M, Niehues J, Stker S, Bentivogli L, Cattoni R, Federico M (2015) The IWSLT 2015 evaluation campaign. In: Proceedings of the international workshop on spoken language translation (IWSLT), Da Nang, Vietnam, pp 2–14

  • Chahuneau V, Schlinger E, Smith NA, Dyer C (2013) Translating into morphologically rich languages with synthetic phrases. In: Proceedings of the 2013 conference on empirical methods in natural language processing. Seattle, Washington, USA, pp 1677–1687

  • Chahuneau V, Smith NA, Dyer C (2013b) Knowledge-rich morphological priors for Bayesian language models. In: Proceedings of NAACL-HLT. Atlanta, Georgia, pp 1206–1215

  • Chen SF, Goodman J (1998) An empirical study of smoothing techniques for language modelling. Technical Report TR-10-98, Computer Science Group, Harvard University

  • Cho K, Van Merriënboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN Encoder-Decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1724–1734

  • Cholakov K, Kordoni V (2014) Better statistical machine translation through linguistic treatment of phrasal verbs. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 196–201

  • Chung J, Cho K, Bengio Y (2016) NYU-MILA neural machine translation systems for WMT16. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 268–271

  • Costa-jussà MR (2015a) How much hybridization does machine translation need? J Assoc Inf Sci Technol 6(10):2160–2165

    Article  Google Scholar 

  • Costa-jussà MR (2015b) Latest trends in hybrid machine translation and its applications. Comput Speech Lang 32(1):3–10

    Article  Google Scholar 

  • Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 workshop on statistical machine translation. Baltimore, Maryland, USA, pp 376–380

  • Ding S, Duh K, Khayrallah H, Koehn P, Post M (2016) The JHU machine translation systems for WMT 2016. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 272–280

  • Donaj G, Kačič Z (2016) Language modeling for automatic speech recognition of inflective languages: an applications-oriented approach using lexical data. Springer, London

    Google Scholar 

  • Dove C, Loskutova O, de la Fuente R (2012) What’s your pick: RbMT, SMT or hybrid? In: Proceedings of 11th conference of the associationfor machine translation in the Americas (AMTA), San Diego, CA

  • Dugonik J, Bošković B, Maučec MS, Brest J (2014) The usage of differential evolution in a statistical machine translation. In: Proceedings of the IEEE symposium series on computational intelligence (SSCI). Orlando, Florida, USA, pp 89–96

  • Durrani N, Sajjad H (2014) Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics. Gothenburg, Sweden, pp 148–153

  • Durrani N, Schmid H, Fraser A (2011) A joint sequence translation model with integrated reordering. In: Proceedings of the 49th annual meeting of the association for computational linguistics (ACL-HLT). Portland, Oregon, USA, pp 1045–1054

  • Durrani N, Fraser A, Schmid H, Hoang H, Koehn P (2013) Can Markov models over minimal translation units help phrase-based SMT? In: Proceedings of the 51st annual conference of the association for computational linguistics (ACL). Sofia, Bulgaria, pp 399–405

  • Durrani N, Koehn P, Schmid H, Fraser A (2014) Investigating the usefulness of generalized word representations in SMT. In: Proceedings of the 25th annual conference on computational linguistics (COLING). Dublin, Ireland, pp 421–432

  • Durrani N, Schmid H, Fraser A, Koehn P, Schütze H (2015) The operation sequence model—combining N-gram-based and phrase-based statistical machine translation. Comput Linguist 41(2):185–214

    Article  MathSciNet  Google Scholar 

  • Dušek O, Žabokrtský Z, Popel M, Dušek M, Novák M, Mareček D (2012) Formemes in English-Czech deep syntactic MT. In: Proceedings of the 7th workshop on statistical machine translation. Association for Computational Linguistics, Montreal, Canada, pp 267–274

  • Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM Model 2. In: Proceedings of NAACL. Atlanta, Georgia, USA, pp 644–648

  • Dzikiene JK, Nivre J, Krupavičius A (2013) Lithuanian dependency parsing with rich morphological features. In: Proceedings of the fourth workshop on statistical parsing of morphologically-rich languages, pp 12–21

  • Eisele A, Federmann C, Saint-Amand H, Jellinghaus M, Herrmann T, Chen Y (2008) Using Moses to integrate multiple rule-based machine translation engines into a hybrid system. In: Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, Columbus, Ohio, USA, pp 179–182

  • Farrús M, Costa-jussà MR, Morse MP (2012) Study and correlation analysis of linguistic, perceptual, and automatic machine translation evaluations. J Am Soc Inf Sci Technol 63(1):174–184

    Article  Google Scholar 

  • Federmann C, Hunsicker S (2011) Stochastic parse tree selection for an existing RBMT system. In: Proceedings of the 6th workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, pp 351–357

  • Felice M, Specia L (2013) Investigating the contribution of linguistic information to quality estimation. Mach Transl 27:193–212

    Article  Google Scholar 

  • Fishel M (2009) Deeper than words: morph-based alignment for statistical machine translation. In: Proceedings of the conference of the pacific association for computational linguistics (PacLing 2009), University of Hokkaido, Sapporo, Japan

  • Galuščáková P, Bojar O (2012) Improving SMT by using parallel data of a closely related language. In: Human Language Technologies—the Baltic Perspective—proceedings of the fifth international conference Baltic HLT 2012, IOS Press, Amsterdam, Netherlands, Frontiers in AI and Applications, vol 247, pp 58–65

  • Gao J, He X, tau Yih W, Deng L (2014) Learning continuous phrase representations for translation modeling. In: Proceedings of the 52nd annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 699–709

  • Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Proceedings of the workshop software engineering, testing, and quality assurance for natural language processing. Association for Computational Linguistics, pp 49–57

  • Gaudio R, Labaka G, Agirre E, Osenova P, Simov K, Popel M, Oele D, van Noord G, Gomes L, Ja António Rodrigues, Neale S, Ja Silva, Querido A, Rendeiro N, Branco A (2016) SMT and hybrid systems of the QTLeap project in the WMT16 IT-task. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 435–441

  • Genzel D (2010) Automatically learning source-side reordering rules for large scale machine translation. In: Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics, pp 376–384

  • Giménez J, Màrquez L (2010) Linguistic measures for automatic machine translation evaluation. Mach Transl 24:209–240

    Article  Google Scholar 

  • Gimpel K, Smith NA (2014) Phrase dependency machine translation with quasi-synchronous tree-to-tree feature. Comput Linguist 40(2):349–401

    Article  Google Scholar 

  • Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing (HLT/EMNLP). Vancouver, Canada, pp 676–683

  • Graham Y, van Genabith J (2010) Factor templates for factored machine translation models. In; Proceedings of the seventh international workshop on spoken language translation (IWSLT). France, Paris, pp 275–282

  • Green N (2011) Effects of noun phrase bracketing in dependency parsing and machine translation. In: Proceedings of the ACL 2011 student session. Association for Computational Linguistics, Portland, OR, USA, pp 69–74

  • Green S, DeNero J (2012) A class-based agreement model for generating accurately inflected translations. In: Proceedings of the 50th annual meeting of the association for computational linguistics. Jeju, Republic of Korea, Association for Computational Linguistics, pp 146–155

  • Hammarströ H, Borin L (2011) Unsupervised learning of morphology. Comput Linguist 37(2):309–350

    Article  MathSciNet  Google Scholar 

  • Hirsimäki T, Pylkkönen J, Kurimo M (2009) Importance of high-order N-gram models in morph-based speech recognition. IEEE/ACM Trans Audio Speech Lang Process 17(4):724–732

    Article  Google Scholar 

  • Ho C, Azmi Murad MA, Doraisamy S, Abdul Kadir R (2014) Extracting lexical and phrasal paraphrases: a review of the literature. Artif Intell Rev 42(4):851–894

    Article  Google Scholar 

  • Hoang C, Sima’an K (2014) Latent domain translation models in mix-of-domains haystack. In: COLING 2014, 25th international conference on computational linguistics, proceedings of the conference: technical papers, August 23–29, 2014. Dublin, Ireland, pp 1928–1939

  • Hoang T, Bojar O (2015) TmTriangulate: a tool for phrase table triangulation. Prague Bull Math Linguist 104:75–86

    Article  Google Scholar 

  • Homola P, Kuboň V (2008) A hybrid machine translation system for typologically related languages. In: Proceedings of the 21st international florida-artificial-intelligence-research-society conference (FLAIRS), pp 227–228

  • Huet S, Manishina E, Lefevre F (2013) Factored machine translation systems for Russian-English. In: Proceedings of the eighth workshop on statistical machine translation. Sofia, Bulgaria, pp 154–157

  • Hunsicker S, Yu C, Federmann C (2012) Machine learning for hybrid machine translation. In: Proceedings of the seventh workshop on statistical machine translation, pp 312–316

  • Ircig P, Psutka JV, Psutka J (2009) Using morphological information for robust language modeling in Czech ASR system. IEEE/ACM Trans Audio Speech Lang Process 17(4):840–847

    Article  Google Scholar 

  • Ircing P, Krbec P, Hajič J, Khudanpur S, Jelinek F, Psutka J, Byrne W (2001) On large vocabulary continuous speech recognition of highly inflectional language—Czech. In: Proceedings of the European conference on speech communication and technology (EUROSPEECH), pp 487–490

  • ISO 9:1995 (1995) Information and documentation transliteration of Cyrillic characters into Latin characters Slavic and non-Slavic languages. International Organization for Standardization

  • Jawaid B, Bojar O (2014) Two-step machine translation with lattices. In: Proceedings of the 9th international conference on language resources and evaluation (LREC 2014). Reykjavík, Iceland, pp 682–686

  • Jean S, Firat O, Cho K, Memisevic R, Bengio Y (2015) Montreal neural machine translation systems for WMT’15. In: Proceedings of the tenth workshop on statistical machine translation. Lisboa, Portugal, pp 134–140

  • Jeong M, Toutanova K, Suzuki H, Quirk C (2010) A discriminative lexicon model for complex morphology. In: The ninth conference of the association for machine translation in the Americas (AMTA). Association for Computational Linguistics

  • Joty S, Guzmán F, Màrquez L, Nakov P (2014) DiscoTK: using discourse structure for machine translation evaluation. In: Proceedings of the ninth workshop on statistical machine translation. Association for Computational Linguistics, Baltimore, Maryland, USA, pp 402–408

  • Juhár J, Staš J, Hládek D (2012) Recent progress in development of language model for Slovak large vocabulary continuous speech recognition. In: New technologies-trends, innovations and research, pp 261–276

  • Junczys-Dowmunt M, Szał A (2011) SyMGiza++: Symmetrized word alignment models for statistical machine translation. In: International joint conferences security and intelligent information systems (SIIS), pp 379–390

  • Junczys-Dowmunt M, Dwojak T, Sennrich R (2016) The AMU-UEDIN submission to the WMT16 news translation task: attention-based NMT models as feature functions in phrase-based SMT. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 319–325

  • Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP), pp 1700–1709

  • Katz SM (1987) Estimation of probabilities from sparse data for the language model component of a speech recogniser. IEEE Trans Acoust Speech Signal Process 35(3):400–401

    Article  MathSciNet  Google Scholar 

  • Kazi M, Salesky E, Thompson B, Ray J, Coury M, Shen W, Anderson T, Erdmann G, Gwinnup J, Young K, Ore B, Hutt M (2014) The MITLL-AFRL IWSLT 2014 MT System. In: Proceedings of the international workshop on spoken language translation (IWSLT), Lake Tahoe, pp 65–73

  • Kipyatkova I, Karpov A (2014) Study of Morphological factors of factored language models for Russian ASR. In: Proceedings of the 16th international conference speech and computer (SPECOM). Novi Sad, Serbia, pp 451–458

  • Kirchhoff K, Yang M, Duh K (2006) Machine translation of parliamentary proceedings using morpho-syntactic knowledge. In: Proceedings of the TC-STAR workshop on speech-to-speech translation

  • Kneser R, Ney H (1993) Improved clustering techniques for class-based statistical language modelling. In: Proceedings of third European conference on speech communication and technology. EUROSPEECH 1993, Berlin, Germany, pp 22–25

  • Koehn P (2011) Statistical machine translation. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Koehn P, Haddow B (2012) Interpolated backoff for factored translation models. In: Proceedings of the tenth conference of the association for machine translation in the Americas (AMTA)

  • Koehn P, Hoang H (2007) Factored translation models. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). Czech Republic, Scotland, Prague, pp 868–876

  • Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the human language technology and North American Association for computational linguistics conference (HLT/NAACL). Czech Republic, Scotland, Prague, pp 48–54

  • Kolovratník D, Klyueva N, Bojar O (2009) Statistical machine translationrelated and unrelated languages. In: ITAT 2009 information technologies—applications and theory, Slovakia, pp 31–36

  • Kos K, Bojar O (2009) Evaluation of machine translation metrics for Czech as the target language. Prague Bull Math Linguist 92:135–147

    Article  Google Scholar 

  • Kuboň V, Vičič J (2014) A comparison of MT Methods for closely related languages: a case study on Czech Slovak language pair. In: Proceedings of the conference language technology for closely related languages and language variants (LT4CloseLang), pp 92–98

  • Labaka G, España-Bonet C, Màrquez L, Sarasola K (2014) A hybrid machine translation architecture guided by syntax. Mach Transl 28(2):91–125

    Article  Google Scholar 

  • Lembersky G, Ordan N, Wintner S (2012) Language models for machine translation: original vs. translated texts. Comput Linguist 38(4):799–825

    Article  MathSciNet  Google Scholar 

  • Lerner U, Petrov S (2013) Source-side classifier preordering for machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP ’13). Seattle, Washington, USA, pp 513–523

  • Libovický J, Pecina P (2015) Tolerant BLEU: a submission to the WMT14 metrics task. In: Proceedings of the ninth workshop on statistical machine translation (SMT), pp 409–413

  • Lo C, Cherry C, Foster G, Stewart D, Islam R, Kazantseva A, Kuhn R (2016) NRC Russian-English machine translation system for WMT 2016. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 326–332

  • Luong MT, Socher R, Manning CD (2013) Better word representations with recursive neural networks for morphology. In: Proceedings of the seventeenth conference on computational natural language learning. Association for Computational Linguistics, Sofia, Bulgaria, pp 104–113

  • Macherey K, Dai AM, Talbot D, Popat AC, Och F (2011) Language-independent compound splitting with morphological operations. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Portland, Oregon, HLT ’11, pp 1395–1404

  • Majewski P (2008) Syllable based language model for large vocabulary continuous speech recognition of Polish. Proceedings of the 11th international conference text, speech and dialogue (TSD). Brno, Czech Republic, pp 397–401

  • Marasek K (2012) TED Polish-to-English translation system for the IWSLT 2012. In: Proceedings of the international workshop on spoken language translation (IWSLT), Hong Kong, pp 126–129

  • Mareček D, Rosa R, Galuščáková P, Bojar O (2011) Two-step translation with grammatical post-processing. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, WMT ’11, pp 426–432

  • Mariño JB, Banchs RE, Crego JM, de Gispert A, Lambert P, Fonollosa JAR, Costa-jussà MR (2006) N-gram-based machine translation. Comput Linguist 32(4):527–549

    Article  MathSciNet  MATH  Google Scholar 

  • Maučec MS, Brest J (2010) Reduction of morpho-syntactic features in statistical machine translation of highly inflective language. Informatica 21(1):95–116

    MATH  Google Scholar 

  • Maučec MS, Donaj G (2016) Morphosyntactic tags in statistical machine translation of highly inflectional language. In: Proceedings of the artificial intelligence and natural language conference (AINL FRUCT). Saint-Petersburg, Russia, pp 99–102

  • Maučec MS, Kačič Z, Verdonik D (2014) Statistical machine translation of subtitles for highly inflected language pair. Pattern Recogn Lett 46:96–103

    Article  Google Scholar 

  • McDonald R, Nivre J (2011) Analyzing and integrating dependency parsers. Comput Linguist 37(1):197–230

    Article  Google Scholar 

  • Mikolov T, Kopecký J, Burget L, Glembek O, Černocký JH (2009) Neural network based language models for highly inflected languages. In: Proceedings of the ICASSP, pp 4725–4728

  • Mikolov T, Yih W, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL HLT). Atlanta, Georgia, pp 746–751

  • Miłkowski M (2012) The Polish language in the digital age, White Paper Series. Springer, Berlin

    Google Scholar 

  • Minkov E, Toutanova K, Suzuki H (2007) Generating complex morphology for machine translation. In: roceedings of the 45th annual meeting of the association of computational linguistics. Association for Computational Linguistics, Prague, Czech Republic, pp 128--135

  • Molchanov A, Bykov F (2016) PROMT translation systems for WMT 2016 translation tasks. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 339–343

  • Morchid M, Huet S, Dufour R (2014) Topic-based approach for post-processing correction of automatic translations. In: Proceedings of the 11th international workshop on spoken language translation, Lake Tahoe, pp 80–85

  • Müller T, Schuetze H, Schmid H (2012) A comparative investigation of morphological language modeling for the languages of the European Union. In: Human language technologies: conference of the North American chapter of the association of computational linguistics, proceedings, June 3–8, 2012. Montréal, Canada, pp 386–395

  • Munková D, Munk M (2014) An automatic evaluation of machine translation and Slavic languages. In: Proceedings of the 8th international conference on application of information and communication technologies (AICT-2014), Astana, pp 447–451

  • Munková D, Munk M (2015) Automatic evaluation of machine translation through the residual analysis. In: Proceedings of the 11th international conference advanced intelligent computing theories and applications. Fuzhou, China, pp 481–490

  • Niehues J, Herrmann T, Vogel S, Waibel A (2011) Wider context by using bilingual language models in machine translation. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, pp 198–206

  • Nivre J (2015) Towards a universal grammar for natural language processing. In: Gelbukh A (ed) Computational linguisticsand intelligent text processing. Springer, Berlin, pp 3–16

    Chapter  Google Scholar 

  • Nivre J, Hall J, Nilsson J, Chanev A, Eryiğit G, Kübler S, Marinov S, Marsi E (2007) MaltParser: a language-independent system for data-driven dependency parsing. Nat Lang Eng 13(2):95–135

    Article  Google Scholar 

  • Novák V, Žabokrtský Z (2007) Feature engineering in maximum spanning tree dependency parser. In: Proceedings of the 10th international conference on text. Pilsen, Czech Republic, Speech and Dialogue, pp 92–98

  • Novák V, Nedoluzhko A, Žabokrtský Z (2013) Translation of “it” in a deep syntax framework. In: Proceedings of the workshop on discourse in machine translation (DiscoMT). Association for Computational Linguistics, Sofia, Bulgaria, pp 51–59

  • Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the 41st annual meeting on association for computational linguistics, vol 1. Association for Computational Linguistics, Sapporo, Japan, pp 160–167

  • Och FJ, Ney H (2003a) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    Article  MATH  Google Scholar 

  • Och FJ, Ney H (2003b) The alignment template approach to statistical machine translation. Comput Linguist 30(4):417–449

    Article  MATH  Google Scholar 

  • Oparin I (2008) Language models for automatic speech recognition of inflectional languages. Ph.D. Dissertation, University of West Bohemia

  • Oparin I, Glembek O, Burget L, Černocký J (2008) Morphological random forests for language modeling of inflectional languages. In: Proceedings of the spoken language technology workshop, (IEEE). Goa, India, pp 189–192

  • Papineni K, Roukos S, Ward T, Zhu WJ (2004) BLEU: a method for automatic evaluation of machine translation. Tech. Rep. RC22176(W0109-022), IBM Research Report, IBM

  • Popel M, Žabokrtský Z (2010) TectoMT: Modular NLP framework. In: Proceedings of the 7th international conference on advances in natural language processing, Reykjavik, Iceland, IceTAL’10, pp 293–304

  • Popel M, Mareček D, Green N, Žabokrtský Z (2011) Influence of parser choice on dependency-based MT. IN: Proceedings of the 6th workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, UK, pp 433–439

  • Popović M (2011) Hjerson: an open source tool for automatic error classification of machine translation output. Prague Bull Math Linguist 96:59–68

    Article  Google Scholar 

  • Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation. Association for Computational Linguistics, Lisbon, Portugal, pp 392–395

  • Popović M, Arčan M (2015) Identifying main obstacles for statistical machine translation of morphologically rich South Slavic languages. In: Proceedings of the eighteenth annual conference of the European association for machine translation (EAMT 15). Antalya, Turkey, pp 97–104

  • Popović M, Ljubešić N (2014) Exploring cross-language statistical machine translation for closely related South Slavic languages. In: Proceedings of the conference: language technology for closely related languages and language variants (LT4CloseLang). Association for Computational Linguistics, Doha, Qatar, pp 76–84

  • Popović M, Ney H (2004) Towards the use of word stems and suffixes for statistical machine translation. In: Proceedings of the 4th international conference on language resources and evaluation (LREC), Lisbon, Portugal, pp 1585–1588

  • Popović M, Ney H (2011) Towards automatic error analysis of machine translation output. Comput Linguist 37(4):657–688

    Article  MathSciNet  Google Scholar 

  • Popović M, Arčan M, Avramidis E, Burchardt A, Lommel AR (2015) Poor man’s lemmatisation for automatic error classification. In: The eighteenth annual conference of the European association for machine translation (EAMT 15), pp 105–112

  • Prochazka V, Pollak P, Zdansky J, Nouza J (2011) Performance of Czech speech recognition with language models created from public resources. Radioengineering 20(4):1002–1008

    Google Scholar 

  • Rishøj C, Søgaard A (2011) Factored translation with unsupervised word clusters. In: Proceedings of the 6th workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, pp 447–451

  • Rosa R, Mareček D, Dušek O (2012) DEPFIX: a system for automatic correction of Czech MT outputs. In: Proceedings of the seventh workshop on statistical machine translation. Association for Computational Linguistics, Montreal, Canada, WMT ’12, pp 362–368

  • Rosa R, Sudarikov R, Novák M, Popel M, Bojar O (2016) Dictionary-based domain adaptation of MT systems without retraining. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 449–455

  • Rotovnik T, Maučec MS, Kačič Z (2007) Large vocabulary continuous speech recognition of an inflected language using stems and endings. Speech Commun 49(6):437–452

    Article  Google Scholar 

  • Ruth J, O’Regan J (2011) Shallow-transfer rule-based machine translation from Czech to Polish. In: Proceedings of the second international workshop on free/open-source rule-based machine translation, pp 69–76

  • Salehi B, Cook P, Baldwin T (2014) Using distributional similarity of multi-way translations to predict multiword expression compositionality. In: Proceedings of the 14th conference of the european chapter of the association for computational linguistics. Association for Computational Linguistics, Gothenburg, Sweden, pp 472–481

  • Schwenk H, Rousseau A, Attik M (2012) Large, pruned or continuous space language models on a GPU for statistical machine translation. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL HLT). Atlanta, Georgia, pp 11–19

  • Seeker W, Kuhn J (2013) Morphological and syntactic case in statistical dependency parsing. Comput Linguist 39:23–55

    Article  Google Scholar 

  • Sennrich R (2015) Modelling and optimizing on syntactic N-grams for statistical machine translation. Trans Assoc Computat Linguist 3:169–182

    Article  Google Scholar 

  • Sennrich R, Haddow B, Birch A (2016a) Edinburgh neural machine translation systems for WMT 16. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, pp 371–376

  • Sennrich R, Haddow B, Birch A (2016b) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics, pp 1715–1725

  • Shaik MAB, Mousa AED, Schüter R, Ney H (2011) Using morpheme and syllable based sub-words for Polish LVCSR. In: Proceedings of ICASSP, pp 4680–4683

  • Shalonova K, Golénia B, Flach P (2009) Towards learning morphology for under-resourced fusional and agglutinating languages. IEEE/ACM Trans Audio Speech Lang Process 17(5):956–965

    Article  Google Scholar 

  • Shin E, Stüker S, Kilgour K, Fügen C, Waibel A (2013) Maximum entropy language modeling for Russian ASR. In: Proceedings of the 10th international workshop on spoken language translation, Heidelberg, Germany, pp 288–294

  • Simova I, Kordoni V (2013) Improving English-Bulgarian statistical machine translation by phrasal verb treatment. In: Workshop on multi-word units in machine translation and translation technologies, pp 62–71

  • Slawik I, Niehues J, Waibel A (2015) Stripping adjectives: integration techniques for selective stemming in SMT systems. In: The eighteenth annual conference of the European association for machine translation (EAMT 15), pp 105–112

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation error rate with targeted human annotation. In: 5th conference of the association for machine translation in the Americas (AMTA), Boston, Massachusetts

  • Son LH, Allauzen A, Yvon F (2012) Continuous space translation models with neural networks. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies, pp 39–48

  • Stanojević M, Sima’an K (2014) BEER: BEtter evaluation as ranking. In: Proceedings of the ninth workshop on statistical machine translation. Association for Computational Linguistics, Baltimore, Maryland, USA, pp 414–419

  • Tamchyna A, Bojar O (2015) What a transfer-based system brings to the combination with PBMT. In: Proceedings of the ACL 2015 fourth workshop on hybrid approaches to translation (HyTra). Association for Computational Linguistics, Beijing, China, pp 11–20

  • Tiedemann J (2012) Character-based pivot translation for under-resourced languages and domains. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics (EACL 2012), The Association for Computational Linguistics, pp 141–151

  • Tiedemann J, Agić Ž, Nivre J (2014) Treebank translation for cross-lingual parser induction. In: Proceedings of the eighteenth conference on computational natural language learning (CoNLL). Avignon, France, pp 130–140

  • Tillmann C (2004) A unigram orientation model for statistical machine translation. In: Proceedings of HLT-NAACL 2004: short papers. Association for Computational Linguistics, Boston, Massachusetts, pp 101–104

  • Tillmann C, Hewavitharana S (2013) A unified alignment algorithm for bilingual data. Nat Lang Eng 19(1):33–60

    Article  Google Scholar 

  • Toral A, Pecina P, Wang L, van Genabith J (2015) Linguistically-augmented perplexity-based data selection for language models. Comput Speech Lang 32:11–26

    Article  Google Scholar 

  • Toutanova K, Suzuki H, Ruopp A (2008) Applying morphology generation models to machine translation. Proc ACL. Association for Computational Linguistics, Columbus, pp 514–522

    Google Scholar 

  • Tran K, Bisazza A, Monz C (2014) Word translation prediction for morphologically rich languages with bilingual neural networks. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1676–1688

  • Tsvetkov Y, Dyer C, Levin L, Bhatia A (2013) Generating English determiners in phrase-based translation with synthetic translation options. In: Proceedings of the eighth workshop on statistical machine translation. Sofia, Bulgaria, pp 271–280

  • Vaswani A, Huang L, Chiang D (2012) Smaller alignment models for better translations: unsupervised word alignment with the l0-norm. In: Proceedings of the 50th annual meeting of the association for computational linguistics, pp 311–319

  • Vazhenina D, Markov K (2013) Factored language modeling for Russian LVCSR. In: Proceedings of the international joint conference on awareness science and technology & ubi-media computing, pp 205–210

  • Vidhu Bhala RV, Abirami S (2014) Trends in word sense disambiguation. Artif Intell Rev 42(2):159–171

    Article  Google Scholar 

  • Virpioja S, Väyrynen J, Mansikkaniemi A, Kurimo M (2010) Applying morphological decomposition to statistical machine translation. In: Proceedings of the joint fifth workshop on statistical machine translation and metrics MATR. Uppsala University, Uppsala, Sweden, pp 195–200

  • Wang L, Wong DF, Chao LS, Lu Y, Xing J (2014) A systematic comparison of data selection criteria for SMT domain adaptation. Sci World J 2014

  • Wang R, Osenova P, Simov K (2012) Linguistically-augmented Bulgarian-to-English statistical machine translation model. IN: Proceedings of the joint workshop on exploiting synergies between information retrieval and machine translation (ESIRMT) and hybrid approaches to machine translation (HyTra). Association for Computational Linguistics, Avignon, France, pp 119–128

  • Wang R, Zhao H, Lu BL (2015) Bilingual continuous-space language model growing for statistical machine translation. IEEE/ACM Trans Audio Speech Lang Process 23(7):1209–1220

    Article  Google Scholar 

  • Wang R, Utiyama M, Goto I, Sumita E, Zhao H, Lu BL (2016) Converting continuous-space language models into N-gram language models with efficient bilingual pruning for statistical machine translation. ACM Trans Asian Low-Resour Lang Inf Process 15(3):11:1–11:26

    Article  Google Scholar 

  • Weller M, Kisselew M, Smekalova S, Fraser A, Schmid H, Durrani N, Sajjad H, Farkas R (2013) Munich-Edinburgh-Stuttgart submissions at WMT13: morphological and syntactic processing for SMT. In: Proceedings of the eighth workshop on statistical machine translation. Association for Computational Linguistics, Sofia, Bulgaria, pp 232–239

  • Williams P, Sennrich R, Post M, Koehn P (2016) Syntax-based statistical machine translation. Morgan & Claypool, San Rafael

    Book  Google Scholar 

  • Wołk K, Marasek K (2013) Polish - English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the international workshop on spoken language translation (IWSLT), Heidelberg, Germany

  • Wołk K, Marasek K (2014a) Enhanced bilingual evaluation understudy. In: Proceedings of the 11th international workshop on spoken language translation (IWSLT), Lake Tahoe, pp 191–197

  • Wołk K, Marasek K (2014b) Polish - English speech statistical machine translation systems for the IWSLT 2014. In: Proceedings of the international workshop on spoken language translation (IWSLT), Lake Tahoe, pp 143–149

  • Wołk K, Marasek K (2015a) Neural-based machine translation for medical text domain. Based on European Medicines Agency leaflet texts. Procedia Comput Sci 64:2–9

    Article  Google Scholar 

  • Wołk K, Marasek K (2015b) PJAIT systems for the IWSLT 2015 evaluation campaign enhanced by comparable corpora. In: Proceedings of the international workshop on spoken language translation (IWSLT), Da Nang, Vietnam, pp 101–104

  • Wołk K, Marasek K, Glinkowski W (2015a) Telemedicine as a special case of the machine translation. Comput Med Imaging Graph 46:249–256

    Article  Google Scholar 

  • Wołk K, Rejmund E, Marasek K (2015b) Harvesting comparable corpora and mining them for equivalent bilingual sentences using statistical classification and analogy-based heuristics. In: Proceedings of the international symposium on methodologies for intelligent systems (ISMIS), pp 433–441

  • Wróblewska A (2011) Polish-English word alignment: preliminary study. Emerg Intell Technol Ind 369:123–132

    Google Scholar 

  • Wu X, Yu H, Liu Q (2014) RED: DCU-CASICT participation in WMT2014 metrics task. In: Proceedings of the ninth workshop on statistical machine translation. Association for Computational Linguistics, Baltimore, Maryland, USA, pp 420–425

  • Xiong D, Zhang M (2015) Backward and trigger-based language models for statistical machine translation. Nat Lang Eng 21(2):201–226

    Article  MathSciNet  Google Scholar 

  • Žabokrtský Z, Ptáček J, Pajas P (2008) TectoMT: Highly modular MT system with tectogrammatics used as transfer layer. In: Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, Columbus, Ohio, USA, pp 167–170

  • Zeman D, Fishel M, Berka J, Bojar O (2011) Addicter: What is wrong with my translations? Prague Bull Math Linguist 96:79–88

    Article  Google Scholar 

  • Zens R, Ney H (2006) Discriminative reordering models for statistical machine translation. In: Proceedings of the workshop on statistical machine translation, New York City, pp 55–63

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their helpful and constructive comments that greatly contributed to improving the paper. Funding was provided by Javna Agencija za Raziskovalno Dejavnost RS (Grant Nos. P2-0069, P2-0041).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mirjam Sepesy Maučec.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Maučec, M.S., Brest, J. Slavic languages in phrase-based statistical machine translation: a survey. Artif Intell Rev 51, 77–117 (2019). https://doi.org/10.1007/s10462-017-9558-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-017-9558-2

Keywords

Navigation