ABSTRACT
In an increasingly globalized world, being able to understand texts in different languages (even more so in different alphabets and charsets) has become a necessity. This can be strategic even while moving and travelling across different countries, characterized by different languages. With this in mind, bilingual corpora become critical resources since they are the basis of every state-of-the-art automatic translation system; moreover, building a parallel corpus is usually a complex and very expensive operation. This paper describes an innovative approach we have defined and adopted to automatically build an Italian-Chinese parallel corpus, with the aim of using it for training an Italian-Chinese Neural Machine Translation. Our main idea is to scrape parallel texts from the Web: we defined a general pipeline, describing each specific step from the selection of the appropriate data sources to the sentence alignment method. A final evaluation was conducted to evaluate the goodness of our approach and its results show that 90% of the sentences were correctly aligned. The corpus we have obtained consists of more than 6,000 sentence pairs (Italian and Chinese), which are the basis for building a Machine Translation system.
- Ahmad Aghaebrahimian, Michael Ustaszewski, and Andy Stauder. 2019. The TransBank Aligner: Cross-Sentence Alignment with Deep Neural Networks. In International Conference on Text, Speech, and Dialogue. Springer, 185--196.Google Scholar
- E Bartlett. [n.d.]. J., JW Kotrlik, et al. (2001)." Organizational research: Determining appropriate sample size in survey research.". Information Technology, Learning, and Performance 19, 1 ([n.d.]).Google Scholar
- Peter F Brown, Jennifer C Lai, and Robert L Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 169--176.Google ScholarDigital Library
- Helena M Caseli, Tiago F Pereira, Lucia Specia, Thiago AS Pardo, Caroline Gasperin, and Sandra Maria Aluísio. 2009. Building a Brazilian Portuguese parallel corpus of original and simplified texts. Advances in Computational Linguistics, Research in Computer Science 41 (2009), 59--70.Google Scholar
- Luca Casini, Giovanni Delnevo, Marco Roccetti, Nicolò Zagni, and Giuseppe Cappiello. 2019. Deep Water: Predicting water meter failures through a human-machine intelligence collaboration. In International Conference on Human Interaction and Emerging Technologies. Springer, 688--694.Google Scholar
- Sunita Chand. 2016. Empirical survey of machine translation tools. In 2016 Second International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN). IEEE, 181--185.Google ScholarCross Ref
- Marta R Costa-Jussa and José AR Fonollosa. 2015. Latest trends in hybrid machine translation and its applications. Computer Speech & Language 32, 1 (2015), 3--10.Google ScholarDigital Library
- Giovanni Delnevo, Marco Roccetti, and Silvia Mirri. 2019. Intelligent and good machines? The role of domain and context codification. Mobile Networks and Applications (2019), 1--9.Google Scholar
- Miquel Esplà-Gomis, Mikel L Forcada, Gema Ramírez-Sánchez, and Hieu Hoang. 2019. ParaCrawl: Web-scale parallel corpora for the languages of the EU. In Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks. 118--119.Google Scholar
- Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, Jim O'Regan, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema Ramírez-Sánchez, and Francis M Tyers. 2011. Apertium: a free/open-source platform for rule-based machine translation. Machine translation 25, 2 (2011), 127--144.Google Scholar
- William A Gale and Kenneth W Church. 1993. A program for aligning sentences in bilingual corpora. Computational linguistics 19, 1 (1993), 75--102.Google Scholar
- Francis Grégoire and Philippe Langlais. 2018. Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. In Proceedings of the 27th International Conference on Computational Linguistics. 1442--1453.Google Scholar
- Yaser Jararweh, Mahmoud Al-Ayyoub, Maged Fakirah, Luay Alawneh, and Brij B Gupta. 2019. Improving the performance of the needleman-wunsch algorithm using parallelization and vectorization techniques. Multimedia Tools and Applications 78, 4 (2019), 3961--3977.Google ScholarDigital Library
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, Vol. 5. Citeseer, 79--86.Google Scholar
- Philipp Koehn. 2009. Statistical machine translation. Cambridge University Press.Google Scholar
- Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872 (2017).Google Scholar
- John Oladosu, Adebimpe Esan, Ibrahim Adeyanju, Benjamin Adegoke, Olatayo Olaniyan, and Bolaji Omodunbi. 2016. Approaches to machine translation: a review. FUOYE Journal of Engineering and Technology 1, 1 (2016).Google ScholarCross Ref
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318.Google Scholar
- Douglas Robinson. 2019. Becoming a translator: An introduction to the theory and practice of translation. Routledge.Google Scholar
- Marco Roccetti, Giovanni Delnevo, Luca Casini, and Giuseppe Cappiello. 2019. Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures. Journal of Big Data 6, 1 (2019), 70.Google ScholarCross Ref
- Marco Roccetti, Giovanni Delnevo, Luca Casini, and Paola Salomoni. 2020. A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis. Mobile Networks and Applications (2020), 1--9.Google Scholar
- André Santos. 2011. A survey on parallel corpora alignment. MI-STAR 2011 (2011), 117--128.Google Scholar
- Rico Sennrich and Martin Volk. 2010. MT-based sentence alignment for OCR-generated parallel texts. In The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010).Google Scholar
- Rico Sennrich and Martin Volk. 2011. Iterative, MT-based sentence alignment of parallel texts. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011). 175--182.Google Scholar
- Jörg Tiedemann. 2009. News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing, Vol. 5. 237--248.Google Scholar
- Seppe vanden Broucke and Bart Baesens. 2018. Stirring the HTML and CSS Soup. In Practical Web Scraping for Data Science. Springer, 49--77.Google Scholar
- Maria Jose Varela-Salinas, Ruth Burbat, et al. 2018. Google translate and deepL: breaking taboos in translator training. (2018).Google Scholar
- Warren Weaver. 1949. Memorandum on Translation.Google Scholar
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).Google Scholar
- Nan Yang, Shujie Liu, Mu Li, Ming Zhou, and Nenghai Yu. 2013. Word alignment modeling with context dependent deep neural network. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 166--175.Google Scholar
- Danni Yu and Yicong Yu. [n.d.]. Knowledge Dissemination in Media Discourse: Analysis of Italian-Chinese/Chinese-Italian Parallel Newspaper Corpora. In Knowledge Dissemination, Etichs, and Ideology in Specialised Communication: Linguistic and Discursive Perspectives Pre-conference Proceedings. 87.Google Scholar
- Zhaorong Zong and Changchun Hong. 2019. Research on Alignment in the Construction of Parallel Corpus. In Journal of Physics: Conference Series, Vol. 1213. IOP Publishing, 042003.Google Scholar
Index Terms
- Building an Italian-Chinese Parallel Corpus for Machine Translation from the Web
Recommendations
Building an English-Vietnamese Bilingual Corpus for Machine Translation
IALP '12: Proceedings of the 2012 International Conference on Asian Language ProcessingBilingual corpora are critical resources for machine translation research and development since parallel corpora contain translation equivalences of various granularities. Manual annotation of word alignments is of significance to provide a gold-...
Development of Hindi-Punjabi parallel corpus using existing Hindi-Punjabi machine translation system
IITM '10: Proceedings of the First International Conference on Intelligent Interactive Technologies and MultimediaThis paper describes the development of Hindi-Punjabi sentence aligned parallel corpus consisting of 50K sentences using existing Hindi-Punjabi Machine Translation (MT) system (available at http://h2p.learnpunjabi.org). This parallel corpus is utmost ...
Construction of Mizo: English Parallel Corpus for Machine Translation
Parallel corpus is a key component of statistical and Neural Machine Translation (NMT). While most research focuses on machine translation, corpus creation studies are limited for many languages, and no research paper on a Mizo–English corpus exists yet. ...
Comments