Skip to main content

OSPT: European Portuguese Paraphrastic Dataset with Machine Translation

  • Conference paper
  • First Online:
Progress in Artificial Intelligence (EPIA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14115))

Included in the following conference series:

  • 337 Accesses

Abstract

We describe OSPT, a new linguistic resource for European Portuguese that comprises more than 1.5 million Portuguese-Portuguese sentential paraphrase pairs. We generated the pairs automatically by using neural machine translation to translate the non-Portuguese side of a large parallel corpus. We hope this new corpus can be a valuable resource for paraphrase generation and provide a rich semantic knowledge source to improve downstream natural language understanding tasks. To show the quality and utility of such a dataset, we use it to train paraphrastic sentence embeddings and evaluate them in the ASSIN2 semantic textual similarity (STS) competition. We found that semantic embeddings trained on a small subset of OSPT can produce better semantic embeddings than the ones trained in the finely curated ASSIN2’s training data. Additionally, we show OSPT can be used for paraphrase generation with the potential to produce good data augmentation systems that pseudo-translate from Brazilian Portuguese to European Portuguese.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The code and data are available at https://github.com/afonso-sousa/pt_para_gen.

  2. 2.

    This model can be found on the HuggingFace as “ricardo-filho/bert-base-portuguese-cased-nli-assin-2”.

  3. 3.

    https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs.

  4. 4.

    We will release code and embeddings under the permissive MIT license.

References

  1. Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: International Conference on Learning Representations (2017)

    Google Scholar 

  2. Bandel, E., Aharonov, R., Shmueli-Scheuer, M., Shnayderman, I., Slonim, N., Ein-Dor, L.: Quality controlled paraphrase generation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 596–609. Association for Computational Linguistics, Dublin, Ireland, May 2022. https://doi.org/10.18653/v1/2022.acl-long.45

  3. Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pp. 597–604. Association for Computational Linguistics, Ann Arbor, Michigan, Jun 2005. https://doi.org/10.3115/1219840.1219914

  4. Barreiro, A., Mota, C.: e-pact: esperto paraphrase aligned corpus of en-ep/bp translations. Traduçao em Revista 1(22), 87–102 (2017)

    Google Scholar 

  5. Barreiro, A., Mota, C., Baptista, J., Chacoto, L., Carvalho, P.: Linguistic resources for paraphrase generation in portuguese: a lexicon-grammar approach. Lang. Resour. Eval. (2021)

    Google Scholar 

  6. Barzilay, R., McKeown, K.R.: Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 50–57. Association for Computational Linguistics, Toulouse, France, Jul 2001. https://doi.org/10.3115/1073012.1073020

  7. Bhagat, R., Hovy, E.: Squibs: what is a paraphrase? Comput. Linguist. 39(3), 463–472 (2013)

    Google Scholar 

  8. Bojar, O., Dušek, O., Kocmi, T., Libovickỳ, J., Novák, M., Popel, M., Sudarikov, R., Variš, D.: Czeng 1.6: enlarged czech-english parallel corpus with processing tools dockered. In: Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, 12–16 Sept. 2016, Proceedings 19. pp. 231–238. Springer (2016)

    Google Scholar 

  9. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. Association for Computational Linguistics, Copenhagen, Denmark, Sept. 2017. https://doi.org/10.18653/v1/D17-1070

  10. Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005) (2005)

    Google Scholar 

  11. Gan, Z., Pu, Y., Henao, R., Li, C., He, X., Carin, L.: Learning generic sentence representations using convolutional neural networks. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2390–2400. Association for Computational Linguistics, Copenhagen, Denmark, Sept. 2017. https://doi.org/10.18653/v1/D17-1254

  12. Ganitkevitch, J., Callison-Burch, C.: The multilingual paraphrase database. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 4276–4283. European Language Resources Association (ELRA), Reykjavik, Iceland, May 2014

    Google Scholar 

  13. Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 758–764. Association for Computational Linguistics, Atlanta, Georgia, Jun 2013

    Google Scholar 

  14. Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, Nov 2021. https://doi.org/10.18653/v1/2021.emnlp-main.552

  15. Henderson, M., Al-Rfou, R., Strope, B., Sung, Y.H., Lukács, L., Guo, R., Kumar, S., Miklos, B., Kurzweil, R.: Efficient natural language response suggestion for smart reply (2017). arXiv:1705.00652

  16. Hill, F., Cho, K., Korhonen, A.: Learning distributed representations of sentences from unlabelled data. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1367–1377. Association for Computational Linguistics, San Diego, CA, June 2016. https://doi.org/10.18653/v1/N16-1162

  17. Hosking, T., Tang, H., Lapata, M.: Hierarchical sketch induction for paraphrase generation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2489–2501. Association for Computational Linguistics, Dublin, Ireland, May 2022. https://doi.org/10.18653/v1/2022.acl-long.178

  18. Jiang, Y., Kummerfeld, J.K., Lasecki, W.S.: Understanding task design trade-offs in crowdsourced paraphrase collection. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 103–109. Association for Computational Linguistics, Vancouver, Canada, Jul 2017. https://doi.org/10.18653/v1/P17-2017

  19. Kim, T., Choi, J., Edmiston, D., Goo Lee, S.: Are pre-trained language models aware of phrases? Simple but strong baselines for grammar induction. In: International Conference on Learning Representations (2020)

    Google Scholar 

  20. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X: Papers, Phuket, Thailand, pp. 79–86, 13–15 Sept. 2005

    Google Scholar 

  21. Lan, W., Qiu, S., He, H., Xu, W.: A continuously growing dataset of sentential paraphrases. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 1224–1234. Association for Computational Linguistics, Copenhagen, Denmark, Sept. 2017. https://doi.org/10.18653/v1/D17-1126

  22. Li, Z., Jiang, X., Shang, L., Li, H.: Paraphrase generation with deep reinforcement learning. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3865–3878. Association for Computational Linguistics, Brussels, Belgium, Oct.–Nov. 2018. https://doi.org/10.18653/v1/D18-1421

  23. Lison, P., Tiedemann, J.: OpenSubtitles2016: extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 923–929. European Language Resources Association (ELRA), Portorož, Slovenia, May 2016

    Google Scholar 

  24. Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L.: Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8, 726–742 (2020)

    Google Scholar 

  25. Mallinson, J., Sennrich, R., Lapata, M.: Paraphrasing revisited with neural machine translation. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 881–893. Association for Computational Linguistics, Valencia, Spain, Apr 2017

    Google Scholar 

  26. Mrkšić, N., Ó Séaghdha, D., Thomson, B., Gašić, M., Rojas-Barahona, L.M., Su, P.H., Vandyke, D., Wen, T.H., Young, S.: Counter-fitting word vectors to linguistic constraints. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 142–148. Association for Computational Linguistics, San Diego, CA, June 2016. https://doi.org/10.18653/v1/N16-1018

  27. Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 528–540. Association for Computational Linguistics, New Orleans, Louisiana, June 2018. https://doi.org/10.18653/v1/N18-1049

  28. Pellicer, L.F.A.O., Pirozelli, P., Costa, A.H.R., Inoue, A.: PTT5-paraphraser: diversity and meaning fidelity in automatic portuguese paraphrasing. In: Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, Fortaleza, Brazil, 21–23 Mar. 2022, Proceedings, pp. 299–309. Springer (2022)

    Google Scholar 

  29. Real, L., Fonseca, E., Oliveira, H.G.: The ASSIN 2 shared task: a quick overview. In: International Conference on Computational Processing of the Portuguese Language, pp. 406–412. Springer (2020)

    Google Scholar 

  30. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2019

    Google Scholar 

  31. Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2020

    Google Scholar 

  32. Scherrer, Y.: TaPaCo: a corpus of sentential paraphrases for 73 languages. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 6868–6873. European Language Resources Association, Marseille, France, May 2020

    Google Scholar 

  33. Sun, H., Zhou, M.: Joint learning of a dual SMT system for paraphrase generation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 38–42. Association for Computational Linguistics, Jeju Island, Korea, July 2012

    Google Scholar 

  34. Suzuki, Y., Kajiwara, T., Komachi, M.: Building a non-trivial paraphrase corpus using multiple machine translation systems. In: Proceedings of ACL 2017, Student Research Workshop, pp. 36–42. Association for Computational Linguistics, Vancouver, Canada, Jul 2017

    Google Scholar 

  35. Tang, Y., Tran, C., Li, X., Chen, P.J., Goyal, N., Chaudhary, V., Gu, J., Fan, A.: Multilingual translation with extensible multilingual pretraining and finetuning (2020). arXiv:2008.00401

  36. Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: From paraphrase database to compositional paraphrase model and back. Trans. Assoc. Comput. Linguist. 3, 345–358 (2015)

    Google Scholar 

  37. Wieting, J., Gimpel, K.: ParaNMT-50M: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 451–462. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1042

  38. Wieting, J., Mallinson, J., Gimpel, K.: Learning paraphrastic sentence embeddings from back-translated bitext. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 274–285. Association for Computational Linguistics, Copenhagen, Denmark, Sept. 2017. https://doi.org/10.18653/v1/D17-1026

  39. Zhang, Y., Baldridge, J., He, L.: PAWS: Paraphrase adversaries from word scrambling. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1298–1308. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1131

Download references

Acknowledgements

The first author is supported by a PhD studentship with reference 2022.13409.BD from Fundação para a Ciência e a Tecnologia (FCT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Afonso Sousa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sousa, A., Cardoso, H.L. (2023). OSPT: European Portuguese Paraphrastic Dataset with Machine Translation. In: Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R. (eds) Progress in Artificial Intelligence. EPIA 2023. Lecture Notes in Computer Science(), vol 14115. Springer, Cham. https://doi.org/10.1007/978-3-031-49008-8_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-49008-8_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-49007-1

  • Online ISBN: 978-3-031-49008-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics