Abstract
The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such systems have a very limited availability especially for some languages and very narrow text domains. Is this research we present our improvements to current quasi-comparable corpora mining methodologies by re-implementing the comparison algorithms, introducing a tuning script and improving performance using GPU acceleration. The experiments are conducted on lectures text domain and bi-data is extracted from web crawl from the WWW. The modifications made a positive impact on the quality and quantity of mined data and on the translation quality as well and used the BLEU, NIST and TER metrics. By defining proper translation parameters to morphologically rich languages we improve the translation quality and draw the conclusions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Wołk, K., Marasek, K.: Real-time statistical speech translation. New Perspectives in Information Systems and Technologies, vol. 1, pp. 107–113. Springer International Publishing (2014)
Wołk, K., Marasek, K.: Polish–English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany, pp. 113–119 (2013)
Koehn, P.: Statistical Machine Translation. Cambridge University Press (2009)
Berrotarán, G., Carrascosa, R., Vine, A.: Yalign documentation. Accessed 01 2015
Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-Japanese parallel sentence extraction from quasi–comparable corpora. ACL 2013, 34 (2013)
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. Natural Language Processing—IJCNLP 2005. Lecture Notes in Computer Science, vol. 3651, pp. 257–268 (2005)
Adafree, S.F., deRijke, M.: Finding similar sentences across multiple languages in Wikipedia (2006)
Mohammadi, M., and Aghaee, N.Q.: Building bilingual parallel corpora based on Wikipedia (2010)
Chu, C., Nakazawa, T., Kurohashi, S.: Accurate parallel fragment extraction from quasi–comparable corpora using alignment model and translation lexicon. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 1144–1150 (2013)
Yasuda, K., Sumita, E.: Method for building sentence-aligned corpus from Wikipedia (2008)
Plamada, M., Volk, M.: Mining for domain-specific parallel texts from the Wikipedia (2013)
Aker, A., Kanoulas, E., Gaizauskas, R.J., A light way to collect comparable corpora from the Web. LREC (2012)
Strötgen, J., Gertz, M., Junghans, C.: An event-centric model for multilingual document similarity. In: SIGIR’11: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, pp. 953–962 (2011)
Wu, D.: Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Comput. Linguist. 23(3), 377–403 (1997)
Sarikaya, R., Maskey, S., Zhang, R., Jan, E. E., Wang, D., Ramabhadran, B., Roukos, S.: Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In: INTERSPEECH, pp. 432–435 (2009)
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Natural Language Processing–IJCNLP 2005, pp. 257–268 (2005)
Cettolo, M., Girardi, C., Federico, M.: WIT3: Web inventory of transcribed and translated talks. In: Proceedings of EAMT, Trento, Italy, pp. 261–268 (2012)
Bojar, O., Rosa, R., Tamchyna, A.: Chimera–three heads for English-to-Czech translation. In: Proceedings of the Eighth Workshop on Statistical Machine Translation. Association for Computational Linguistics Sofia, Bulgaria, pp. 90–96 (2013)
Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. Lect. Notes Comput. Sci. 1398(1998), 137–142 (2005)
Wołk, K., Marasek, K.: A sentence meaning based alignment method for parallel text corpora preparation. Advances in Intelligent Systems and Computing, vol. 275, pp. 107–114. Springer, Madeira Island, Portugal (2014). ISSN 2194-5357. ISBN 978-3-319-05950-1
Roessler R.: A GPU implementation of Needleman-Wunsch. Specifically for use in the Program PyroNoise 2 (2010)
Koehn, P., Haddow, B.: Towards effective use of training data in statistical machine translation. In: WMT’12 Proceedings of the Seventh Workshop on Statistical Machine Translation, Stroudsburg, PA, USA, 317–321 (2012)
Clark, J.H., Dyer, C., Lavie, A., Smith, N.A.: Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, vol. 2, pp. 176–181. Association for Computational Linguistics (2011)
Acknowledgments
This research was supported by Polish-Japanese Academy of Information Technology statutory resources (ST/MUL/2016), resources for young researchers at PJATK and CLARIN ERIC research program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this paper
Cite this paper
Wołk, K., Marasek, K. (2017). Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual Data. In: Zgrzywa, A., Choroś, K., Siemiński, A. (eds) Multimedia and Network Information Systems. Advances in Intelligent Systems and Computing, vol 506. Springer, Cham. https://doi.org/10.1007/978-3-319-43982-2_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-43982-2_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43981-5
Online ISBN: 978-3-319-43982-2
eBook Packages: EngineeringEngineering (R0)