Exploratory Data Analysis for the Automatic Detection of Question Paraphrasing in Collaborative Environments

Alcantara, Tania; Calvo, Hiram

doi:10.1007/978-3-031-19496-2_15

Tania Alcantara¹⁰ &
Hiram Calvo¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13613))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

519 Accesses

Abstract

Internet searches are a daily occurrence, but we must be aware that more than one person searches the same topic with different words, this is called paraphrasing. Paraphrasing involves syntactic changes and the overlapping of words, linked to the rules of the language in which we work. The identification is a problem of great importance for natural language processing (NLP), especially paraphrasing questions with the same intention. In addition, it has been found that for the study of similarities, some features are not taken into account, which makes the identification yield lower results. In this paper, we address the problem of automatic paraphrase identification in the Quora Question Pair (QQP) dataset, paying special attention to data’s shape through exploratory data analysis (EDA). This is in order to obtain better results in the identification tasks, as well as to compare different classifiers in collaborative environments where resources are limited.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arase, Y., Tsujii, J.: Transfer fine-tuning: a BERT case study, pp. 5393–5404 (2019)
Google Scholar
Arase, Y., Tsujii, J.: Transfer fine-tuning of BERT with phrasal paraphrases. Comput. Speech Lang. 66, 101164 (2021)
Article Google Scholar
Ayala, M.: Paráfrasis. lifeder.com (2021)
Google Scholar
Baevski, A., Hsu, W., Xu, Q., Babu, A., Gu, J., Auli, M.: data2vec: a general framework for self-supervised learning in speech, vision and language. CoRR, abs/2202.03555 (2022)
Google Scholar
Barrron, A., Marti, A., Vila, M., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Lingüistics 39, 917–947 (2013)
Article Google Scholar
Beyer, H.: Tukey, John W.: Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading, Mass.—Menlo Park, Cal., London, Amsterdam, Don Mills, Ontario, Sydney 1977, XVI, 688 s. Biometr. J. 23(4), 413–414 (1981)
Google Scholar
Camizuli, E., Carranza, E.J.: Exploratory data analysis (EDA) (2018)
Google Scholar
Chen, Z., Zhang, H., Zhang, X., Zhao, L.: Quora question pairs. Universityof Waterloo (2018)
Google Scholar
Chopra, A., Agrawal, S., Ghosh, S.: Applying transfer learning for improving domain-specific search experience using query to question similarity (2020)
Google Scholar
Correa, B., Londoño, C.: Los 5 tipos de plagio más frecuentes (2018)
Google Scholar
Deng, L., Liu, Y.: Deep Learning in Natural Language Processing, 1st edn. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-5209-5
Book Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2021)
Google Scholar
Dong, Q., Wan, X., Cao, Y.: ParaSCI: a large scientific paraphrase dataset for longer paraphrase generation, pp. 424–434. CoRR, abs/2101.08382 (2021)
Google Scholar
E. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
Google Scholar
Flores, R.: Citas y referencias. recomendaciones y aspectos básicos del estilo APA, biblioteca de la universidad de lima. Principios para citar parafrasear y resumir; Cómo evitar el plagio accidental (2014)
Google Scholar
Godbole, A., Dalmia, A., Sahu, S.K.: Siamese neural networks with random forest for detecting duplicate question pairs. CoRR, abs/1801.07288 (2018)
Google Scholar
He, R., Ravula, A., Kanagal, B., Ainslie, J.: RealFormer: transformer likes residual attention. CoRR, abs/2012.11747 (2020)
Google Scholar
Hermann, M., Frank, K., Bilal, Z.: Plagiarism - a survey. J. Univ. Comput. Sci. 08(25), 1050–1084 (2006)
Google Scholar
Kiros, R., et al.: Skip-thought vectors, vol. 28 (2015)
Google Scholar
Lan, W., Qiu, S., He, H., Xu, W.: A continuously growing dataset of sentential paraphrases, pp. 1224–1234 (2017)
Google Scholar
Lee-Thorp, J., Ainslie, J., Eckstein, I., Ontañón, S.: FNet: mixing tokens with Fourier transforms. CoRR, abs/2105.03824 (2021)
Google Scholar
Mota-Montoya, M., Cunha-Da, I., López-Escobedo, F.: Un corpus de paráfrasis en español: metodología, elaboración y análisis. RLA: Revista de Lingüistica Teórica y Aplicada (54) (2016)
Google Scholar
Prabhumoye, S., Tsvetkov, Y., Salakhutdinov, R., Black, A.-W.: Style transfer through back-translation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018)
Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016)
Google Scholar
Rishickesh, R., Ram Kumar, R.P., Shahina, A., Nayeemullah Khan, A.: Identification of duplication in questions posed on knowledge sharing platform quora using machine learning techniques. Int. J. Innov. Technol. Explor. Eng. (IJITEE) 8(12), 2444–2451 (2019)
Google Scholar
Roig, M.: Avoiding plagiarism, self-plagiarism, and other questionable writing practices: a guide to ethical writing. States Department of Health & Human Services, Office of Research Integrity (2019)
Google Scholar
Sanchez-Perez, M.A., Gelbukh, A., Sidorov, G.: Adaptive algorithm for plagiarism detection: the best-performing approach at PAN 2014 text alignment competition. In: Mothe, J., et al. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 402–413. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24027-5_42
Chapter Google Scholar
Sanjay, C.: The quora question pair similarity problem (2021). https://towardsdatascience.com/the-quora-question-pair-similarity-problem-3598477af172. Accessed 25 Aug 2022
Segura, M.: N-gramas sintácticos para el reconocimiento de paráfrasis (2014)
Google Scholar
Segura-Olivares, A., Garcia, A., Calvo, H.: Feature analysis for paraphrase recognition and textual entailment. Res. Comput. Scie. 70, 119–144 (2013)
Article Google Scholar
Tay, Y., et al.: Charformer: fast character transformers via gradient-based subword tokenization. CoRR, abs/2106.12672 (2021)
Google Scholar
Thompson, V.: Methods for detecting parpharse plagiarism. Department of Computer Science, University of Sunderland (2017)
Google Scholar
Lin, W.-Y., Peng, N., Yen, C.-C., Lin, S.-D.: Online plagiarism detection through exploiting lexical, syntactic, and semantic information. In: Proceedings of the ACL 2012 System Demonstrations (2012)
Google Scholar
Wang, S., Fang, H., Khabsa, M., Mao, H., Ma, H.: Entailment as few-shot learner. CoRR, abs/2104.14690 (2021)
Google Scholar
Wang, W., et al.: StructBERT: incorporating language structures into pre-training for deep language understanding. CoRR, abs/1908.04577 (2019)
Google Scholar
Wang, Z., Hamza, W., Florian, R.: Bilateral multi-perspective matching for natural language sentences, pp. 4144–4150 (2017)
Google Scholar
Zhang, R., Zhou, Q., Wu, B., Li, W., Mo, T.: What do questions exactly ask? MFAE: duplicate question identification with multi-fusion asking emphasis, pp. 226–234 (2020)
Google Scholar
Zubarev, D., Schonkov, I.: Pharaphrased plagiarism detection using sentence similarity. Conference Paper (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico City, Mexico
Tania Alcantara & Hiram Calvo

Authors

Tania Alcantara
View author publications
You can also search for this author in PubMed Google Scholar
Hiram Calvo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tania Alcantara .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico, Mexico
Obdulia Pichardo Lagunas
Centro de Investigación Científica y de Educación Superior de Ensenada, Ensenada, Baja California, Mexico
Juan Martínez-Miranda
Instituto Politécnico Nacional, Mexico, Mexico
Bella Martínez Seis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alcantara, T., Calvo, H. (2022). Exploratory Data Analysis for the Automatic Detection of Question Paraphrasing in Collaborative Environments. In: Pichardo Lagunas, O., Martínez-Miranda, J., Martínez Seis, B. (eds) Advances in Computational Intelligence. MICAI 2022. Lecture Notes in Computer Science(), vol 13613. Springer, Cham. https://doi.org/10.1007/978-3-031-19496-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-19496-2_15
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19495-5
Online ISBN: 978-3-031-19496-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploratory Data Analysis for the Automatic Detection of Question Paraphrasing in Collaborative Environments