Skip to main content

Exploratory Data Analysis for the Automatic Detection of Question Paraphrasing in Collaborative Environments

  • Conference paper
  • First Online:
Advances in Computational Intelligence (MICAI 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13613))

Included in the following conference series:

  • 519 Accesses

Abstract

Internet searches are a daily occurrence, but we must be aware that more than one person searches the same topic with different words, this is called paraphrasing. Paraphrasing involves syntactic changes and the overlapping of words, linked to the rules of the language in which we work. The identification is a problem of great importance for natural language processing (NLP), especially paraphrasing questions with the same intention. In addition, it has been found that for the study of similarities, some features are not taken into account, which makes the identification yield lower results. In this paper, we address the problem of automatic paraphrase identification in the Quora Question Pair (QQP) dataset, paying special attention to data’s shape through exploratory data analysis (EDA). This is in order to obtain better results in the identification tasks, as well as to compare different classifiers in collaborative environments where resources are limited.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Arase, Y., Tsujii, J.: Transfer fine-tuning: a BERT case study, pp. 5393–5404 (2019)

    Google Scholar 

  • Arase, Y., Tsujii, J.: Transfer fine-tuning of BERT with phrasal paraphrases. Comput. Speech Lang. 66, 101164 (2021)

    Article  Google Scholar 

  • Ayala, M.: Paráfrasis. lifeder.com (2021)

    Google Scholar 

  • Baevski, A., Hsu, W., Xu, Q., Babu, A., Gu, J., Auli, M.: data2vec: a general framework for self-supervised learning in speech, vision and language. CoRR, abs/2202.03555 (2022)

    Google Scholar 

  • Barrron, A., Marti, A., Vila, M., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Lingüistics 39, 917–947 (2013)

    Article  Google Scholar 

  • Beyer, H.: Tukey, John W.: Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading, Mass.—Menlo Park, Cal., London, Amsterdam, Don Mills, Ontario, Sydney 1977, XVI, 688 s. Biometr. J. 23(4), 413–414 (1981)

    Google Scholar 

  • Camizuli, E., Carranza, E.J.: Exploratory data analysis (EDA) (2018)

    Google Scholar 

  • Chen, Z., Zhang, H., Zhang, X., Zhao, L.: Quora question pairs. Universityof Waterloo (2018)

    Google Scholar 

  • Chopra, A., Agrawal, S., Ghosh, S.: Applying transfer learning for improving domain-specific search experience using query to question similarity (2020)

    Google Scholar 

  • Correa, B., Londoño, C.: Los 5 tipos de plagio más frecuentes (2018)

    Google Scholar 

  • Deng, L., Liu, Y.: Deep Learning in Natural Language Processing, 1st edn. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-5209-5

    Book  Google Scholar 

  • Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2021)

    Google Scholar 

  • Dong, Q., Wan, X., Cao, Y.: ParaSCI: a large scientific paraphrase dataset for longer paraphrase generation, pp. 424–434. CoRR, abs/2101.08382 (2021)

    Google Scholar 

  • E. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

    Google Scholar 

  • Flores, R.: Citas y referencias. recomendaciones y aspectos básicos del estilo APA, biblioteca de la universidad de lima. Principios para citar parafrasear y resumir; Cómo evitar el plagio accidental (2014)

    Google Scholar 

  • Godbole, A., Dalmia, A., Sahu, S.K.: Siamese neural networks with random forest for detecting duplicate question pairs. CoRR, abs/1801.07288 (2018)

    Google Scholar 

  • He, R., Ravula, A., Kanagal, B., Ainslie, J.: RealFormer: transformer likes residual attention. CoRR, abs/2012.11747 (2020)

    Google Scholar 

  • Hermann, M., Frank, K., Bilal, Z.: Plagiarism - a survey. J. Univ. Comput. Sci. 08(25), 1050–1084 (2006)

    Google Scholar 

  • Kiros, R., et al.: Skip-thought vectors, vol. 28 (2015)

    Google Scholar 

  • Lan, W., Qiu, S., He, H., Xu, W.: A continuously growing dataset of sentential paraphrases, pp. 1224–1234 (2017)

    Google Scholar 

  • Lee-Thorp, J., Ainslie, J., Eckstein, I., Ontañón, S.: FNet: mixing tokens with Fourier transforms. CoRR, abs/2105.03824 (2021)

    Google Scholar 

  • Mota-Montoya, M., Cunha-Da, I., López-Escobedo, F.: Un corpus de paráfrasis en español: metodología, elaboración y análisis. RLA: Revista de Lingüistica Teórica y Aplicada (54) (2016)

    Google Scholar 

  • Prabhumoye, S., Tsvetkov, Y., Salakhutdinov, R., Black, A.-W.: Style transfer through back-translation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018)

    Google Scholar 

  • Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016)

    Google Scholar 

  • Rishickesh, R., Ram Kumar, R.P., Shahina, A., Nayeemullah Khan, A.: Identification of duplication in questions posed on knowledge sharing platform quora using machine learning techniques. Int. J. Innov. Technol. Explor. Eng. (IJITEE) 8(12), 2444–2451 (2019)

    Google Scholar 

  • Roig, M.: Avoiding plagiarism, self-plagiarism, and other questionable writing practices: a guide to ethical writing. States Department of Health & Human Services, Office of Research Integrity (2019)

    Google Scholar 

  • Sanchez-Perez, M.A., Gelbukh, A., Sidorov, G.: Adaptive algorithm for plagiarism detection: the best-performing approach at PAN 2014 text alignment competition. In: Mothe, J., et al. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 402–413. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24027-5_42

    Chapter  Google Scholar 

  • Sanjay, C.: The quora question pair similarity problem (2021). https://towardsdatascience.com/the-quora-question-pair-similarity-problem-3598477af172. Accessed 25 Aug 2022

  • Segura, M.: N-gramas sintácticos para el reconocimiento de paráfrasis (2014)

    Google Scholar 

  • Segura-Olivares, A., Garcia, A., Calvo, H.: Feature analysis for paraphrase recognition and textual entailment. Res. Comput. Scie. 70, 119–144 (2013)

    Article  Google Scholar 

  • Tay, Y., et al.: Charformer: fast character transformers via gradient-based subword tokenization. CoRR, abs/2106.12672 (2021)

    Google Scholar 

  • Thompson, V.: Methods for detecting parpharse plagiarism. Department of Computer Science, University of Sunderland (2017)

    Google Scholar 

  • Lin, W.-Y., Peng, N., Yen, C.-C., Lin, S.-D.: Online plagiarism detection through exploiting lexical, syntactic, and semantic information. In: Proceedings of the ACL 2012 System Demonstrations (2012)

    Google Scholar 

  • Wang, S., Fang, H., Khabsa, M., Mao, H., Ma, H.: Entailment as few-shot learner. CoRR, abs/2104.14690 (2021)

    Google Scholar 

  • Wang, W., et al.: StructBERT: incorporating language structures into pre-training for deep language understanding. CoRR, abs/1908.04577 (2019)

    Google Scholar 

  • Wang, Z., Hamza, W., Florian, R.: Bilateral multi-perspective matching for natural language sentences, pp. 4144–4150 (2017)

    Google Scholar 

  • Zhang, R., Zhou, Q., Wu, B., Li, W., Mo, T.: What do questions exactly ask? MFAE: duplicate question identification with multi-fusion asking emphasis, pp. 226–234 (2020)

    Google Scholar 

  • Zubarev, D., Schonkov, I.: Pharaphrased plagiarism detection using sentence similarity. Conference Paper (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tania Alcantara .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alcantara, T., Calvo, H. (2022). Exploratory Data Analysis for the Automatic Detection of Question Paraphrasing in Collaborative Environments. In: Pichardo Lagunas, O., Martínez-Miranda, J., Martínez Seis, B. (eds) Advances in Computational Intelligence. MICAI 2022. Lecture Notes in Computer Science(), vol 13613. Springer, Cham. https://doi.org/10.1007/978-3-031-19496-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19496-2_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19495-5

  • Online ISBN: 978-3-031-19496-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics