skip to main content
10.1145/3411170.3411258acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgoodtechsConference Proceedingsconference-collections
research-article

Building an Italian-Chinese Parallel Corpus for Machine Translation from the Web

Authors Info & Claims
Published:14 September 2020Publication History

ABSTRACT

In an increasingly globalized world, being able to understand texts in different languages (even more so in different alphabets and charsets) has become a necessity. This can be strategic even while moving and travelling across different countries, characterized by different languages. With this in mind, bilingual corpora become critical resources since they are the basis of every state-of-the-art automatic translation system; moreover, building a parallel corpus is usually a complex and very expensive operation. This paper describes an innovative approach we have defined and adopted to automatically build an Italian-Chinese parallel corpus, with the aim of using it for training an Italian-Chinese Neural Machine Translation. Our main idea is to scrape parallel texts from the Web: we defined a general pipeline, describing each specific step from the selection of the appropriate data sources to the sentence alignment method. A final evaluation was conducted to evaluate the goodness of our approach and its results show that 90% of the sentences were correctly aligned. The corpus we have obtained consists of more than 6,000 sentence pairs (Italian and Chinese), which are the basis for building a Machine Translation system.

References

  1. Ahmad Aghaebrahimian, Michael Ustaszewski, and Andy Stauder. 2019. The TransBank Aligner: Cross-Sentence Alignment with Deep Neural Networks. In International Conference on Text, Speech, and Dialogue. Springer, 185--196.Google ScholarGoogle Scholar
  2. E Bartlett. [n.d.]. J., JW Kotrlik, et al. (2001)." Organizational research: Determining appropriate sample size in survey research.". Information Technology, Learning, and Performance 19, 1 ([n.d.]).Google ScholarGoogle Scholar
  3. Peter F Brown, Jennifer C Lai, and Robert L Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 169--176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Helena M Caseli, Tiago F Pereira, Lucia Specia, Thiago AS Pardo, Caroline Gasperin, and Sandra Maria Aluísio. 2009. Building a Brazilian Portuguese parallel corpus of original and simplified texts. Advances in Computational Linguistics, Research in Computer Science 41 (2009), 59--70.Google ScholarGoogle Scholar
  5. Luca Casini, Giovanni Delnevo, Marco Roccetti, Nicolò Zagni, and Giuseppe Cappiello. 2019. Deep Water: Predicting water meter failures through a human-machine intelligence collaboration. In International Conference on Human Interaction and Emerging Technologies. Springer, 688--694.Google ScholarGoogle Scholar
  6. Sunita Chand. 2016. Empirical survey of machine translation tools. In 2016 Second International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN). IEEE, 181--185.Google ScholarGoogle ScholarCross RefCross Ref
  7. Marta R Costa-Jussa and José AR Fonollosa. 2015. Latest trends in hybrid machine translation and its applications. Computer Speech & Language 32, 1 (2015), 3--10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Giovanni Delnevo, Marco Roccetti, and Silvia Mirri. 2019. Intelligent and good machines? The role of domain and context codification. Mobile Networks and Applications (2019), 1--9.Google ScholarGoogle Scholar
  9. Miquel Esplà-Gomis, Mikel L Forcada, Gema Ramírez-Sánchez, and Hieu Hoang. 2019. ParaCrawl: Web-scale parallel corpora for the languages of the EU. In Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks. 118--119.Google ScholarGoogle Scholar
  10. Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, Jim O'Regan, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema Ramírez-Sánchez, and Francis M Tyers. 2011. Apertium: a free/open-source platform for rule-based machine translation. Machine translation 25, 2 (2011), 127--144.Google ScholarGoogle Scholar
  11. William A Gale and Kenneth W Church. 1993. A program for aligning sentences in bilingual corpora. Computational linguistics 19, 1 (1993), 75--102.Google ScholarGoogle Scholar
  12. Francis Grégoire and Philippe Langlais. 2018. Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. In Proceedings of the 27th International Conference on Computational Linguistics. 1442--1453.Google ScholarGoogle Scholar
  13. Yaser Jararweh, Mahmoud Al-Ayyoub, Maged Fakirah, Luay Alawneh, and Brij B Gupta. 2019. Improving the performance of the needleman-wunsch algorithm using parallelization and vectorization techniques. Multimedia Tools and Applications 78, 4 (2019), 3961--3977.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, Vol. 5. Citeseer, 79--86.Google ScholarGoogle Scholar
  15. Philipp Koehn. 2009. Statistical machine translation. Cambridge University Press.Google ScholarGoogle Scholar
  16. Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872 (2017).Google ScholarGoogle Scholar
  17. John Oladosu, Adebimpe Esan, Ibrahim Adeyanju, Benjamin Adegoke, Olatayo Olaniyan, and Bolaji Omodunbi. 2016. Approaches to machine translation: a review. FUOYE Journal of Engineering and Technology 1, 1 (2016).Google ScholarGoogle ScholarCross RefCross Ref
  18. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318.Google ScholarGoogle Scholar
  19. Douglas Robinson. 2019. Becoming a translator: An introduction to the theory and practice of translation. Routledge.Google ScholarGoogle Scholar
  20. Marco Roccetti, Giovanni Delnevo, Luca Casini, and Giuseppe Cappiello. 2019. Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures. Journal of Big Data 6, 1 (2019), 70.Google ScholarGoogle ScholarCross RefCross Ref
  21. Marco Roccetti, Giovanni Delnevo, Luca Casini, and Paola Salomoni. 2020. A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis. Mobile Networks and Applications (2020), 1--9.Google ScholarGoogle Scholar
  22. André Santos. 2011. A survey on parallel corpora alignment. MI-STAR 2011 (2011), 117--128.Google ScholarGoogle Scholar
  23. Rico Sennrich and Martin Volk. 2010. MT-based sentence alignment for OCR-generated parallel texts. In The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010).Google ScholarGoogle Scholar
  24. Rico Sennrich and Martin Volk. 2011. Iterative, MT-based sentence alignment of parallel texts. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011). 175--182.Google ScholarGoogle Scholar
  25. Jörg Tiedemann. 2009. News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing, Vol. 5. 237--248.Google ScholarGoogle Scholar
  26. Seppe vanden Broucke and Bart Baesens. 2018. Stirring the HTML and CSS Soup. In Practical Web Scraping for Data Science. Springer, 49--77.Google ScholarGoogle Scholar
  27. Maria Jose Varela-Salinas, Ruth Burbat, et al. 2018. Google translate and deepL: breaking taboos in translator training. (2018).Google ScholarGoogle Scholar
  28. Warren Weaver. 1949. Memorandum on Translation.Google ScholarGoogle Scholar
  29. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).Google ScholarGoogle Scholar
  30. Nan Yang, Shujie Liu, Mu Li, Ming Zhou, and Nenghai Yu. 2013. Word alignment modeling with context dependent deep neural network. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 166--175.Google ScholarGoogle Scholar
  31. Danni Yu and Yicong Yu. [n.d.]. Knowledge Dissemination in Media Discourse: Analysis of Italian-Chinese/Chinese-Italian Parallel Newspaper Corpora. In Knowledge Dissemination, Etichs, and Ideology in Specialised Communication: Linguistic and Discursive Perspectives Pre-conference Proceedings. 87.Google ScholarGoogle Scholar
  32. Zhaorong Zong and Changchun Hong. 2019. Research on Alignment in the Construction of Parallel Corpus. In Journal of Physics: Conference Series, Vol. 1213. IOP Publishing, 042003.Google ScholarGoogle Scholar

Index Terms

  1. Building an Italian-Chinese Parallel Corpus for Machine Translation from the Web

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        GoodTechs '20: Proceedings of the 6th EAI International Conference on Smart Objects and Technologies for Social Good
        September 2020
        286 pages
        ISBN:9781450375597
        DOI:10.1145/3411170

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 September 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader