Abstract
Internet searches are a daily occurrence, but we must be aware that more than one person searches the same topic with different words, this is called paraphrasing. Paraphrasing involves syntactic changes and the overlapping of words, linked to the rules of the language in which we work. The identification is a problem of great importance for natural language processing (NLP), especially paraphrasing questions with the same intention. In addition, it has been found that for the study of similarities, some features are not taken into account, which makes the identification yield lower results. In this paper, we address the problem of automatic paraphrase identification in the Quora Question Pair (QQP) dataset, paying special attention to data’s shape through exploratory data analysis (EDA). This is in order to obtain better results in the identification tasks, as well as to compare different classifiers in collaborative environments where resources are limited.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arase, Y., Tsujii, J.: Transfer fine-tuning: a BERT case study, pp. 5393–5404 (2019)
Arase, Y., Tsujii, J.: Transfer fine-tuning of BERT with phrasal paraphrases. Comput. Speech Lang. 66, 101164 (2021)
Ayala, M.: Paráfrasis. lifeder.com (2021)
Baevski, A., Hsu, W., Xu, Q., Babu, A., Gu, J., Auli, M.: data2vec: a general framework for self-supervised learning in speech, vision and language. CoRR, abs/2202.03555 (2022)
Barrron, A., Marti, A., Vila, M., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Lingüistics 39, 917–947 (2013)
Beyer, H.: Tukey, John W.: Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading, Mass.—Menlo Park, Cal., London, Amsterdam, Don Mills, Ontario, Sydney 1977, XVI, 688 s. Biometr. J. 23(4), 413–414 (1981)
Camizuli, E., Carranza, E.J.: Exploratory data analysis (EDA) (2018)
Chen, Z., Zhang, H., Zhang, X., Zhao, L.: Quora question pairs. Universityof Waterloo (2018)
Chopra, A., Agrawal, S., Ghosh, S.: Applying transfer learning for improving domain-specific search experience using query to question similarity (2020)
Correa, B., Londoño, C.: Los 5 tipos de plagio más frecuentes (2018)
Deng, L., Liu, Y.: Deep Learning in Natural Language Processing, 1st edn. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-5209-5
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2021)
Dong, Q., Wan, X., Cao, Y.: ParaSCI: a large scientific paraphrase dataset for longer paraphrase generation, pp. 424–434. CoRR, abs/2101.08382 (2021)
E. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
Flores, R.: Citas y referencias. recomendaciones y aspectos básicos del estilo APA, biblioteca de la universidad de lima. Principios para citar parafrasear y resumir; Cómo evitar el plagio accidental (2014)
Godbole, A., Dalmia, A., Sahu, S.K.: Siamese neural networks with random forest for detecting duplicate question pairs. CoRR, abs/1801.07288 (2018)
He, R., Ravula, A., Kanagal, B., Ainslie, J.: RealFormer: transformer likes residual attention. CoRR, abs/2012.11747 (2020)
Hermann, M., Frank, K., Bilal, Z.: Plagiarism - a survey. J. Univ. Comput. Sci. 08(25), 1050–1084 (2006)
Kiros, R., et al.: Skip-thought vectors, vol. 28 (2015)
Lan, W., Qiu, S., He, H., Xu, W.: A continuously growing dataset of sentential paraphrases, pp. 1224–1234 (2017)
Lee-Thorp, J., Ainslie, J., Eckstein, I., Ontañón, S.: FNet: mixing tokens with Fourier transforms. CoRR, abs/2105.03824 (2021)
Mota-Montoya, M., Cunha-Da, I., López-Escobedo, F.: Un corpus de paráfrasis en español: metodología, elaboración y análisis. RLA: Revista de Lingüistica Teórica y Aplicada (54) (2016)
Prabhumoye, S., Tsvetkov, Y., Salakhutdinov, R., Black, A.-W.: Style transfer through back-translation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018)
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016)
Rishickesh, R., Ram Kumar, R.P., Shahina, A., Nayeemullah Khan, A.: Identification of duplication in questions posed on knowledge sharing platform quora using machine learning techniques. Int. J. Innov. Technol. Explor. Eng. (IJITEE) 8(12), 2444–2451 (2019)
Roig, M.: Avoiding plagiarism, self-plagiarism, and other questionable writing practices: a guide to ethical writing. States Department of Health & Human Services, Office of Research Integrity (2019)
Sanchez-Perez, M.A., Gelbukh, A., Sidorov, G.: Adaptive algorithm for plagiarism detection: the best-performing approach at PAN 2014 text alignment competition. In: Mothe, J., et al. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 402–413. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24027-5_42
Sanjay, C.: The quora question pair similarity problem (2021). https://towardsdatascience.com/the-quora-question-pair-similarity-problem-3598477af172. Accessed 25 Aug 2022
Segura, M.: N-gramas sintácticos para el reconocimiento de paráfrasis (2014)
Segura-Olivares, A., Garcia, A., Calvo, H.: Feature analysis for paraphrase recognition and textual entailment. Res. Comput. Scie. 70, 119–144 (2013)
Tay, Y., et al.: Charformer: fast character transformers via gradient-based subword tokenization. CoRR, abs/2106.12672 (2021)
Thompson, V.: Methods for detecting parpharse plagiarism. Department of Computer Science, University of Sunderland (2017)
Lin, W.-Y., Peng, N., Yen, C.-C., Lin, S.-D.: Online plagiarism detection through exploiting lexical, syntactic, and semantic information. In: Proceedings of the ACL 2012 System Demonstrations (2012)
Wang, S., Fang, H., Khabsa, M., Mao, H., Ma, H.: Entailment as few-shot learner. CoRR, abs/2104.14690 (2021)
Wang, W., et al.: StructBERT: incorporating language structures into pre-training for deep language understanding. CoRR, abs/1908.04577 (2019)
Wang, Z., Hamza, W., Florian, R.: Bilateral multi-perspective matching for natural language sentences, pp. 4144–4150 (2017)
Zhang, R., Zhou, Q., Wu, B., Li, W., Mo, T.: What do questions exactly ask? MFAE: duplicate question identification with multi-fusion asking emphasis, pp. 226–234 (2020)
Zubarev, D., Schonkov, I.: Pharaphrased plagiarism detection using sentence similarity. Conference Paper (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Alcantara, T., Calvo, H. (2022). Exploratory Data Analysis for the Automatic Detection of Question Paraphrasing in Collaborative Environments. In: Pichardo Lagunas, O., Martínez-Miranda, J., Martínez Seis, B. (eds) Advances in Computational Intelligence. MICAI 2022. Lecture Notes in Computer Science(), vol 13613. Springer, Cham. https://doi.org/10.1007/978-3-031-19496-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-19496-2_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19495-5
Online ISBN: 978-3-031-19496-2
eBook Packages: Computer ScienceComputer Science (R0)