OSPT: European Portuguese Paraphrastic Dataset with Machine Translation

Sousa, Afonso; Cardoso, Henrique Lopes

doi:10.1007/978-3-031-49008-8_36

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14115))

Included in the following conference series:

EPIA Conference on Artificial Intelligence

337 Accesses

Abstract

We describe OSPT, a new linguistic resource for European Portuguese that comprises more than 1.5 million Portuguese-Portuguese sentential paraphrase pairs. We generated the pairs automatically by using neural machine translation to translate the non-Portuguese side of a large parallel corpus. We hope this new corpus can be a valuable resource for paraphrase generation and provide a rich semantic knowledge source to improve downstream natural language understanding tasks. To show the quality and utility of such a dataset, we use it to train paraphrastic sentence embeddings and evaluate them in the ASSIN2 semantic textual similarity (STS) competition. We found that semantic embeddings trained on a small subset of OSPT can produce better semantic embeddings than the ones trained in the finely curated ASSIN2’s training data. Additionally, we show OSPT can be used for paraphrase generation with the potential to produce good data augmentation systems that pseudo-translate from Brazilian Portuguese to European Portuguese.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The code and data are available at https://github.com/afonso-sousa/pt_para_gen.
2.
This model can be found on the HuggingFace as “ricardo-filho/bert-base-portuguese-cased-nli-assin-2”.
3.
https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs.
4.
We will release code and embeddings under the permissive MIT license.

References

Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: International Conference on Learning Representations (2017)
Google Scholar
Bandel, E., Aharonov, R., Shmueli-Scheuer, M., Shnayderman, I., Slonim, N., Ein-Dor, L.: Quality controlled paraphrase generation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 596–609. Association for Computational Linguistics, Dublin, Ireland, May 2022. https://doi.org/10.18653/v1/2022.acl-long.45
Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pp. 597–604. Association for Computational Linguistics, Ann Arbor, Michigan, Jun 2005. https://doi.org/10.3115/1219840.1219914
Barreiro, A., Mota, C.: e-pact: esperto paraphrase aligned corpus of en-ep/bp translations. Traduçao em Revista 1(22), 87–102 (2017)
Google Scholar
Barreiro, A., Mota, C., Baptista, J., Chacoto, L., Carvalho, P.: Linguistic resources for paraphrase generation in portuguese: a lexicon-grammar approach. Lang. Resour. Eval. (2021)
Google Scholar
Barzilay, R., McKeown, K.R.: Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 50–57. Association for Computational Linguistics, Toulouse, France, Jul 2001. https://doi.org/10.3115/1073012.1073020
Bhagat, R., Hovy, E.: Squibs: what is a paraphrase? Comput. Linguist. 39(3), 463–472 (2013)
Google Scholar
Bojar, O., Dušek, O., Kocmi, T., Libovickỳ, J., Novák, M., Popel, M., Sudarikov, R., Variš, D.: Czeng 1.6: enlarged czech-english parallel corpus with processing tools dockered. In: Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, 12–16 Sept. 2016, Proceedings 19. pp. 231–238. Springer (2016)
Google Scholar
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. Association for Computational Linguistics, Copenhagen, Denmark, Sept. 2017. https://doi.org/10.18653/v1/D17-1070
Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005) (2005)
Google Scholar
Gan, Z., Pu, Y., Henao, R., Li, C., He, X., Carin, L.: Learning generic sentence representations using convolutional neural networks. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2390–2400. Association for Computational Linguistics, Copenhagen, Denmark, Sept. 2017. https://doi.org/10.18653/v1/D17-1254
Ganitkevitch, J., Callison-Burch, C.: The multilingual paraphrase database. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 4276–4283. European Language Resources Association (ELRA), Reykjavik, Iceland, May 2014
Google Scholar
Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 758–764. Association for Computational Linguistics, Atlanta, Georgia, Jun 2013
Google Scholar
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, Nov 2021. https://doi.org/10.18653/v1/2021.emnlp-main.552
Henderson, M., Al-Rfou, R., Strope, B., Sung, Y.H., Lukács, L., Guo, R., Kumar, S., Miklos, B., Kurzweil, R.: Efficient natural language response suggestion for smart reply (2017). arXiv:1705.00652
Hill, F., Cho, K., Korhonen, A.: Learning distributed representations of sentences from unlabelled data. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1367–1377. Association for Computational Linguistics, San Diego, CA, June 2016. https://doi.org/10.18653/v1/N16-1162
Hosking, T., Tang, H., Lapata, M.: Hierarchical sketch induction for paraphrase generation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2489–2501. Association for Computational Linguistics, Dublin, Ireland, May 2022. https://doi.org/10.18653/v1/2022.acl-long.178
Jiang, Y., Kummerfeld, J.K., Lasecki, W.S.: Understanding task design trade-offs in crowdsourced paraphrase collection. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 103–109. Association for Computational Linguistics, Vancouver, Canada, Jul 2017. https://doi.org/10.18653/v1/P17-2017
Kim, T., Choi, J., Edmiston, D., Goo Lee, S.: Are pre-trained language models aware of phrases? Simple but strong baselines for grammar induction. In: International Conference on Learning Representations (2020)
Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X: Papers, Phuket, Thailand, pp. 79–86, 13–15 Sept. 2005
Google Scholar
Lan, W., Qiu, S., He, H., Xu, W.: A continuously growing dataset of sentential paraphrases. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 1224–1234. Association for Computational Linguistics, Copenhagen, Denmark, Sept. 2017. https://doi.org/10.18653/v1/D17-1126
Li, Z., Jiang, X., Shang, L., Li, H.: Paraphrase generation with deep reinforcement learning. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3865–3878. Association for Computational Linguistics, Brussels, Belgium, Oct.–Nov. 2018. https://doi.org/10.18653/v1/D18-1421
Lison, P., Tiedemann, J.: OpenSubtitles2016: extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 923–929. European Language Resources Association (ELRA), Portorož, Slovenia, May 2016
Google Scholar
Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020)
Google Scholar
Mallinson, J., Sennrich, R., Lapata, M.: Paraphrasing revisited with neural machine translation. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 881–893. Association for Computational Linguistics, Valencia, Spain, Apr 2017
Google Scholar
Mrkšić, N., Ó Séaghdha, D., Thomson, B., Gašić, M., Rojas-Barahona, L.M., Su, P.H., Vandyke, D., Wen, T.H., Young, S.: Counter-fitting word vectors to linguistic constraints. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 142–148. Association for Computational Linguistics, San Diego, CA, June 2016. https://doi.org/10.18653/v1/N16-1018
Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 528–540. Association for Computational Linguistics, New Orleans, Louisiana, June 2018. https://doi.org/10.18653/v1/N18-1049
Pellicer, L.F.A.O., Pirozelli, P., Costa, A.H.R., Inoue, A.: PTT5-paraphraser: diversity and meaning fidelity in automatic portuguese paraphrasing. In: Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, Fortaleza, Brazil, 21–23 Mar. 2022, Proceedings, pp. 299–309. Springer (2022)
Google Scholar
Real, L., Fonseca, E., Oliveira, H.G.: The ASSIN 2 shared task: a quick overview. In: International Conference on Computational Processing of the Portuguese Language, pp. 406–412. Springer (2020)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2019
Google Scholar
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2020
Google Scholar
Scherrer, Y.: TaPaCo: a corpus of sentential paraphrases for 73 languages. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 6868–6873. European Language Resources Association, Marseille, France, May 2020
Google Scholar
Sun, H., Zhou, M.: Joint learning of a dual SMT system for paraphrase generation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 38–42. Association for Computational Linguistics, Jeju Island, Korea, July 2012
Google Scholar
Suzuki, Y., Kajiwara, T., Komachi, M.: Building a non-trivial paraphrase corpus using multiple machine translation systems. In: Proceedings of ACL 2017, Student Research Workshop, pp. 36–42. Association for Computational Linguistics, Vancouver, Canada, Jul 2017
Google Scholar
Tang, Y., Tran, C., Li, X., Chen, P.J., Goyal, N., Chaudhary, V., Gu, J., Fan, A.: Multilingual translation with extensible multilingual pretraining and finetuning (2020). arXiv:2008.00401
Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: From paraphrase database to compositional paraphrase model and back. Trans. Assoc. Comput. Linguist. 3, 345–358 (2015)
Google Scholar
Wieting, J., Gimpel, K.: ParaNMT-50M: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 451–462. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1042
Wieting, J., Mallinson, J., Gimpel, K.: Learning paraphrastic sentence embeddings from back-translated bitext. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 274–285. Association for Computational Linguistics, Copenhagen, Denmark, Sept. 2017. https://doi.org/10.18653/v1/D17-1026
Zhang, Y., Baldridge, J., He, L.: PAWS: Paraphrase adversaries from word scrambling. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1298–1308. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1131

Download references

Acknowledgements

The first author is supported by a PhD studentship with reference 2022.13409.BD from Fundação para a Ciência e a Tecnologia (FCT).

Author information

Authors and Affiliations

Faculdade de Engenharia, Universidade do Porto, Porto, Portugal
Afonso Sousa & Henrique Lopes Cardoso
Laboratório de Inteligência Artificial e Ciência de Computadores (LIACC), Porto, Portugal
Afonso Sousa & Henrique Lopes Cardoso

Authors

Afonso Sousa
View author publications
You can also search for this author in PubMed Google Scholar
Henrique Lopes Cardoso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Afonso Sousa .

Editor information

Editors and Affiliations

Lucy Family Institute for Data & Society, Notre Dame, IN, USA
Nuno Moniz
GECAD, Polytechnic of Porto, Porto, Portugal
Zita Vale
GRIA - LIACC, University of Azores, Ponta-Delgada, Portugal
José Cascalho
CISUC, University of Coimbra, Coimbra, Portugal
Catarina Silva
IEETA, University of Aveiro, Aveiro, Portugal
Raquel Sebastião

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sousa, A., Cardoso, H.L. (2023). OSPT: European Portuguese Paraphrastic Dataset with Machine Translation. In: Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R. (eds) Progress in Artificial Intelligence. EPIA 2023. Lecture Notes in Computer Science(), vol 14115. Springer, Cham. https://doi.org/10.1007/978-3-031-49008-8_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-49008-8_36
Published: 15 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49007-1
Online ISBN: 978-3-031-49008-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics