Abstract
Design and implementation of automatic evaluation methods is an integral part of any scientific research in accelerating the development cycle of the output. This is no less true for automatic machine translation (MT) systems. However, no such global and systematic scheme exists for evaluation of performance of an MT system. The existing evaluation metrics, such as BLEU, METEOR, TER, although used extensively in literature have faced a lot of criticism from users. Moreover, performance of these metrics often varies with the pair of languages under consideration. The above observation is no less pertinent with respect to translations involving languages of the Indian subcontinent. This study aims at developing an evaluation metric for English to Hindi MT outputs. As a part of this process, a set of probable errors have been identified manually as well as automatically. Linear regression has been used for computing weight/penalty for each error, while taking human evaluations into consideration. A sentence score is computed as the weighted sum of the errors. A set of 126 models has been built using different single classifiers and ensemble of classifiers in order to find the most suitable model for allocating appropriate weight/penalty for each error. The outputs of the models have been compared with the state-of-the-art evaluation metrics. The models developed for manually identified errors correlate well with manual evaluation scores, whereas the models for the automatically identified errors have low correlation with the manual scores. This indicates the need for further improvement and development of sophisticated linguistic tools for automatic identification and extraction of errors. Although many automatic machine translation tools are being developed for many different language pairs, there is no such generalized scheme that would lead to designing meaningful metrics for their evaluation. The proposed scheme should help in developing such metrics for different language pairs in the coming days.
Similar content being viewed by others
Notes
The translator as accessed in September, 2017. All the translations mentioned in the paper have been done in Sept, 2017.
If the subject is a pronoun, the gender will be that of the noun the pronoun is referring to.
http://sivareddy.in/downloads (The tagger has been developed by IIIT Hyderabad).
References
Balyan, R., & Chatterjee, N. (2015). Translating noun compounds using semantic relations. Computer Speech & Language, 32(1), 91–108.
Banerjee, S. & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization at 43rd ACL, Ann Arbor, Michigan.
Bernard, C. (1989). Language universals and linguistic typology. Chicago: The University of Chicago Press.
Bharati, A. & Kulkarni, A. (2005). English from Hindi viewpoint: A Paaninian perspective. In Platinum Jubilee conference of Linguistic Society of India, held at CALTS, University of Hyderabad, Hyderabad.
Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123–140.
Breiman, L. (1996b). Stacked regressions. Machine Learning, 24(1), 49–64.
Chatterjee, N., & Balyan, R. (2011). Context resolution of verb particle constructions for English to Hindi translation. In Proceedings of the 25th Pacific Asia conference on language, information and computation (PACLIC 25), Singapore (pp. 140–149).
Chatterjee, N., Johnson, A., & Krishna, M. (2007). Some Improvements over the BLEU metric for measuring translation quality for Hindi. Proceedings of ICCTA IEEE Computer Society, 2007, 485–490.
Dave, S., Parikh, J., & Bhattacharya, P. (2001). Interlingua-based English–Hindi machine translation and language divergence. Machine Translation, 16, 251.
Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of LILT 2002, human language technology conference, San Diego, California. pp. 138–145.
Dorr, B. (1993). Machine translation: A view from the Lexicon. Cambridge, MA: The MIT Press.
Dorr, B. (1994). Classification of machine translation divergences and a proposed solution. Computational Linguistics, 20(4), 597–633.
Farrús, M., Costa-jussà, M. R., Mariño, J. B., & Fonollosa, J. A. R. (2010). Linguistic-based evaluation criteria to identify statistical machine translation errors. In Proceedings of EAMT, Saint Raphael, France (pp. 52–57).
Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proceedings of the thirteenth international conference on machine learning, Bari, Italy (pp. 148–156).
Guenther, W. C. (1964). Analysis of variance. Upper Saddle river: Prentice-Hall.
Gupta, D., & Chatterjee, N. (2001). Study of divergence for example based English–Hindi machine translation, STRANS 2001. IIT Kanpur, 2001, 132–139.
Gupta, D., & Chatterjee, N. (2003). Identification of divergence for English-to-Hindi EBMT. In MT Summit-IX, Orleans. LA, 2003 (pp. 141–148).
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8), 707–710.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia (pp. 311–318).
Popović, M. (2011). Hjerson: An open source tool for automatic error classification of machine translation output. The Prague Bulletin of Mathematical Linguistics, 96, 59–67.
Popović, M., & Ney, H. (2007). Word error rates: Decomposition over POS classes and applications for error analysis. In Proceedings of the 2nd ACL 07 workshop on statistical machine translation (WMT 07), Prague, Czech Republic (pp. 48–55).
Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227.
Sinha, R. M. K., & Thakur, A. (2005a). Translation divergence in English–Hindi MT, In EAMT 10th annual conference, Budapest. Hungary, May 2005 (pp. 245–254).
Sinha, R. M. K., & Thakur, A. (2005b). Divergence patterns in machine translation between Hindi and English, In MT Summit X. Phuket. Thailand, September 2005 (pp. 346–353).
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas—AMTA 2006. Cambridge, MA (pp. 223–231).
Stone, C. J. (1985). Additive regression and other nonparametric models. The Annals of Statistics, 13(2), 689–705.
Vilar, D., Xu, J., D’Haro, L. F., & Ney, H. (2006). Error analysis of statistical machine translation output. In Proceedings of the 5th international conference on language resources and evaluation (LREC, 06). Genoa (pp. 697–702).
Wolpert, D. (1992). Stacked generalization. Neural Networks, 5(2), 241–260.
Author information
Authors and Affiliations
Corresponding author
Additional information
Renu Balyan: Work done while at IIT Delhi.
Rights and permissions
About this article
Cite this article
Balyan, R., Chatterjee, N. Factor-based evaluation for English to Hindi MT outputs. Lang Resources & Evaluation 52, 969–996 (2018). https://doi.org/10.1007/s10579-018-9426-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-018-9426-y