research-article

Building an Italian-Chinese Parallel Corpus for Machine Translation from the Web

Authors:
Rita Tse

School of Applied Sciences - Macao Polytechnic Institute, Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence, Ministry of Education - Macao (China)

School of Applied Sciences - Macao Polytechnic Institute, Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence, Ministry of Education - Macao (China)
View Profile

,
Silvia Mirri

Department of Computer Science and Engineering - University of Bologna - Bologna (Italy)

Department of Computer Science and Engineering - University of Bologna - Bologna (Italy)
View Profile

,
Su-Kit Tang

School of Applied Sciences - Macao Polytechnic Institute, Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence, Ministry of Education - Macao (China)

School of Applied Sciences - Macao Polytechnic Institute, Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence, Ministry of Education - Macao (China)
View Profile

,
Giovanni Pau

Department of Computer Science and Engineering - University of Bologna - Bologna (Italy), Computer Science Department - UCLA - Los Angeles, CA (USA)

Department of Computer Science and Engineering - University of Bologna - Bologna (Italy), Computer Science Department - UCLA - Los Angeles, CA (USA)
View Profile

,
Paola Salomoni

Department of Computer Science and Engineering - University of Bologna - Bologna (Italy)

Department of Computer Science and Engineering - University of Bologna - Bologna (Italy)
View Profile

GoodTechs '20: Proceedings of the 6th EAI International Conference on Smart Objects and Technologies for Social GoodSeptember 2020Pages 265–268https://doi.org/10.1145/3411170.3411258

Published:14 September 2020Publication History

GoodTechs '20: Proceedings of the 6th EAI International Conference on Smart Objects and Technologies for Social Good

Pages 265–268

ABSTRACT

In an increasingly globalized world, being able to understand texts in different languages (even more so in different alphabets and charsets) has become a necessity. This can be strategic even while moving and travelling across different countries, characterized by different languages. With this in mind, bilingual corpora become critical resources since they are the basis of every state-of-the-art automatic translation system; moreover, building a parallel corpus is usually a complex and very expensive operation. This paper describes an innovative approach we have defined and adopted to automatically build an Italian-Chinese parallel corpus, with the aim of using it for training an Italian-Chinese Neural Machine Translation. Our main idea is to scrape parallel texts from the Web: we defined a general pipeline, describing each specific step from the selection of the appropriate data sources to the sentence alignment method. A final evaluation was conducted to evaluate the goodness of our approach and its results show that 90% of the sentences were correctly aligned. The corpus we have obtained consists of more than 6,000 sentence pairs (Italian and Chinese), which are the basis for building a Machine Translation system.

References

Ahmad Aghaebrahimian, Michael Ustaszewski, and Andy Stauder. 2019. The TransBank Aligner: Cross-Sentence Alignment with Deep Neural Networks. In International Conference on Text, Speech, and Dialogue. Springer, 185--196.Google Scholar
E Bartlett. [n.d.]. J., JW Kotrlik, et al. (2001)." Organizational research: Determining appropriate sample size in survey research.". Information Technology, Learning, and Performance 19, 1 ([n.d.]).Google Scholar
Peter F Brown, Jennifer C Lai, and Robert L Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 169--176.Google ScholarDigital Library
Helena M Caseli, Tiago F Pereira, Lucia Specia, Thiago AS Pardo, Caroline Gasperin, and Sandra Maria Aluísio. 2009. Building a Brazilian Portuguese parallel corpus of original and simplified texts. Advances in Computational Linguistics, Research in Computer Science 41 (2009), 59--70.Google Scholar
Luca Casini, Giovanni Delnevo, Marco Roccetti, Nicolò Zagni, and Giuseppe Cappiello. 2019. Deep Water: Predicting water meter failures through a human-machine intelligence collaboration. In International Conference on Human Interaction and Emerging Technologies. Springer, 688--694.Google Scholar
Sunita Chand. 2016. Empirical survey of machine translation tools. In 2016 Second International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN). IEEE, 181--185.Google ScholarCross Ref
Marta R Costa-Jussa and José AR Fonollosa. 2015. Latest trends in hybrid machine translation and its applications. Computer Speech & Language 32, 1 (2015), 3--10.Google ScholarDigital Library
Giovanni Delnevo, Marco Roccetti, and Silvia Mirri. 2019. Intelligent and good machines? The role of domain and context codification. Mobile Networks and Applications (2019), 1--9.Google Scholar
Miquel Esplà-Gomis, Mikel L Forcada, Gema Ramírez-Sánchez, and Hieu Hoang. 2019. ParaCrawl: Web-scale parallel corpora for the languages of the EU. In Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks. 118--119.Google Scholar
Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, Jim O'Regan, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema Ramírez-Sánchez, and Francis M Tyers. 2011. Apertium: a free/open-source platform for rule-based machine translation. Machine translation 25, 2 (2011), 127--144.Google Scholar
William A Gale and Kenneth W Church. 1993. A program for aligning sentences in bilingual corpora. Computational linguistics 19, 1 (1993), 75--102.Google Scholar
Francis Grégoire and Philippe Langlais. 2018. Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. In Proceedings of the 27th International Conference on Computational Linguistics. 1442--1453.Google Scholar
Yaser Jararweh, Mahmoud Al-Ayyoub, Maged Fakirah, Luay Alawneh, and Brij B Gupta. 2019. Improving the performance of the needleman-wunsch algorithm using parallelization and vectorization techniques. Multimedia Tools and Applications 78, 4 (2019), 3961--3977.Google ScholarDigital Library
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, Vol. 5. Citeseer, 79--86.Google Scholar
Philipp Koehn. 2009. Statistical machine translation. Cambridge University Press.Google Scholar
Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872 (2017).Google Scholar
John Oladosu, Adebimpe Esan, Ibrahim Adeyanju, Benjamin Adegoke, Olatayo Olaniyan, and Bolaji Omodunbi. 2016. Approaches to machine translation: a review. FUOYE Journal of Engineering and Technology 1, 1 (2016).Google ScholarCross Ref
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318.Google Scholar
Douglas Robinson. 2019. Becoming a translator: An introduction to the theory and practice of translation. Routledge.Google Scholar
Marco Roccetti, Giovanni Delnevo, Luca Casini, and Giuseppe Cappiello. 2019. Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures. Journal of Big Data 6, 1 (2019), 70.Google ScholarCross Ref
Marco Roccetti, Giovanni Delnevo, Luca Casini, and Paola Salomoni. 2020. A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis. Mobile Networks and Applications (2020), 1--9.Google Scholar
André Santos. 2011. A survey on parallel corpora alignment. MI-STAR 2011 (2011), 117--128.Google Scholar
Rico Sennrich and Martin Volk. 2010. MT-based sentence alignment for OCR-generated parallel texts. In The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010).Google Scholar
Rico Sennrich and Martin Volk. 2011. Iterative, MT-based sentence alignment of parallel texts. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011). 175--182.Google Scholar
Jörg Tiedemann. 2009. News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing, Vol. 5. 237--248.Google Scholar
Seppe vanden Broucke and Bart Baesens. 2018. Stirring the HTML and CSS Soup. In Practical Web Scraping for Data Science. Springer, 49--77.Google Scholar
Maria Jose Varela-Salinas, Ruth Burbat, et al. 2018. Google translate and deepL: breaking taboos in translator training. (2018).Google Scholar
Warren Weaver. 1949. Memorandum on Translation.Google Scholar
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).Google Scholar
Nan Yang, Shujie Liu, Mu Li, Ming Zhou, and Nenghai Yu. 2013. Word alignment modeling with context dependent deep neural network. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 166--175.Google Scholar
Danni Yu and Yicong Yu. [n.d.]. Knowledge Dissemination in Media Discourse: Analysis of Italian-Chinese/Chinese-Italian Parallel Newspaper Corpora. In Knowledge Dissemination, Etichs, and Ideology in Specialised Communication: Linguistic and Discursive Perspectives Pre-conference Proceedings. 87.Google Scholar
Zhaorong Zong and Changchun Hong. 2019. Research on Alignment in the Construction of Parallel Corpus. In Journal of Physics: Conference Series, Vol. 1213. IOP Publishing, 042003.Google Scholar

Index Terms

Building an Italian-Chinese Parallel Corpus for Machine Translation from the Web
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation
  2. Machine learning

Recommendations

Building an English-Vietnamese Bilingual Corpus for Machine Translation
IALP '12: Proceedings of the 2012 International Conference on Asian Language Processing

Bilingual corpora are critical resources for machine translation research and development since parallel corpora contain translation equivalences of various granularities. Manual annotation of word alignments is of significance to provide a gold-...
Read More
Development of Hindi-Punjabi parallel corpus using existing Hindi-Punjabi machine translation system
IITM '10: Proceedings of the First International Conference on Intelligent Interactive Technologies and Multimedia

This paper describes the development of Hindi-Punjabi sentence aligned parallel corpus consisting of 50K sentences using existing Hindi-Punjabi Machine Translation (MT) system (available at http://h2p.learnpunjabi.org). This parallel corpus is utmost ...
Read More
Construction of Mizo: English Parallel Corpus for Machine Translation
Parallel corpus is a key component of statistical and Neural Machine Translation (NMT). While most research focuses on machine translation, corpus creation studies are limited for many languages, and no research paper on a Mizo–English corpus exists yet. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GoodTechs '20: Proceedings of the 6th EAI International Conference on Smart Objects and Technologies for Social Good
September 2020
286 pages
ISBN:9781450375597
DOI:10.1145/3411170
General Chairs:
Catia Prandi
University of Bologna, Italy
,
Johann Marquez-Barja
University of Antwerp - imec, Belgium
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 September 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Italian-Chinese corpus
bilingual annotation
machine translation
sentence alignment
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 119
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Building an Italian-Chinese Parallel Corpus for Machine Translation from the Web

GoodTechs '20: Proceedings of the 6th EAI International Conference on Smart Objects and Technologies for Social Good

ABSTRACT

References

Cited By

Index Terms

Recommendations

Building an English-Vietnamese Bilingual Corpus for Machine Translation

Development of Hindi-Punjabi parallel corpus using existing Hindi-Punjabi machine translation system

Construction of Mizo: English Parallel Corpus for Machine Translation