Abstract
Paraphrase Identification (PI) is an important task in Natural Language Processing (NLP), which aims to detect whether two sentences expressed in various forms are semantically consistent. It can be used to solve the problem of duplicate detection in QA Communities (eg: Quora and Stack Overflow). There have many studies that applied Convolutional Neural Networks to capture rich matching information between sentence pairs layer by layer. However, only a limited number of studies have explored the more flexible Graph Convolutional Networks (GCNs) for this task. GCN operates directly on the graph, and learns the representation of the node according to the neighborhood information of nodes. Thus, the interactive information between two sentences can be effectively integrated based on the local graph structure. In this paper, a Graph-based Interaction Matching model (GIMM) for PI is proposed. GIMM takes each word as a node, the word co-occurrence relations between sentence pairs, and the phrase relations within a single sentence as the relations between nodes to build the interaction graph. Then, the GCN are applied to learn the richer word representations based on the local structure of the graph. Finally, the node representations are aligned by the Attention mechanism to obtain the matching vector, and the results of PI are obtained by the Fully Connected Layer. We conduct experiments to compare the performance of GIMM with the current baselines on the Quora and Stack Overflow datasets. Experimental results demonstrate that the proposed model achieves excellent performance on both of these datasets.
Similar content being viewed by others
Data Availability
The data associated with our study can be made available upon request. Please contact the corresponding author Xuan Zhang (zhxuan@ynu.edu.cn).
Notes
Our pre-trained word embeddings are openly available in https://nlp.stanford.edu/projects
References
Xiao Han (2020) Hungarian layer: A novel interpretable neural layer for paraphrase identification. Neural Netw 131:172–184
Callison-Burch, C, Koehn P, Osborne M (2006) Improved statistical machine translation using paraphrases. In Proceedings of the human language technology conference of the NAACL, Main Conference, pp 17–24
Wallis P (1993) Information retrieval based on paraphrase. In Proceedings of pacling conference, pp 118–126
Das Arijit, Saha Diganta (2022) Deep learning based Bengali question answering system using semantic textual similarity. Multimedia Tools Applic 81(1):589–613
Lukashenko R, Graudina V, Grundspenkis J (2007) Computer-based plagiarism detection methods and tools: an overview. In Proceedings of the 2007 international conference on computer systems and technologies, pp 1–6
Zhang Yun et al (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997
Ahasanuzzaman M et al (2016) Mining duplicate questions of stack overflow. In 2016 IEEE/ACM 13th working conference on mining software repositories (MSR), IEEE, pp 402–412
Hoogeveen D, Verspoor KM, Baldwin T (2015) CQADupStack: A benchmark data set for community question-answering research. In Proceedings of the 20th australasian document computing symposium, pp 1–8
Nakov P et al (2016) SemEval-2016 Task 3: Community Question Answering. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), pp 525–545
Gong Y, Luo H, Zhang J (2018) Natural language inference over interaction space. In International conference on learning representations, pp 1–15
Chen Q et al (2017) Enhanced LSTM for natural language inference. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long Papers), pp 1657–1668
Wang Z, Hamza W, Florian R (n.d.) Bilateral multi-perspective matching for natural language sentences. In Proceedings of the 26th international joint conference on artificial intelligence (IJCAI-17), pp 4144–4150
Arase Yuki, Tsujii Junichi (2021) Transfer fine-tuning of BERT with phrasal paraphrases. Comput Speech Lang 66:101164
Palivela Hemant (2021) Optimization of paraphrase generation and identification using language models in natural language processing. Int J Inform Manag Data Insights 1(2):100025
Mueller J, Thyagarajan A (2016) Siamese recurrent architectures for learning sentence similarity. In Proceedings of the AAAI conference on artificial intelligence 30(1):2786–2792
Wang Z, Mi H, Ittycheriah A (2016) Sentence similarity learning by lexical decomposition and composition. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers, pp 1340–1349
Yang R, Zhang J, Gao X et al (2019) Simple and effective text matching with richer alignment features[C]. In Proceedings of the 57th annual meeting of the association for computational linguistics, pp 4699–4709
Yu Chuanming et al (2021) A simple and efficient text matching model based on deep interaction. Inform Proc Manag 58(6):102738
Xint Y et al (2021) Label incorporated graph neural networks for text classification. In 2020 25th international conference on pattern recognition (ICPR), IEEE, pp 8892–8898
Liu Y et al (2021) Deep attention diffusion graph neural networks for text classification. In Proceedings of the 2021 conference on empirical methods in natural language processing, pp 8892–8898
Luo Y, Zhao H (2020) Bipartite flat-graph network for nested named entity recognition. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp 6408–6418
Qu M et al (2020) Few-shot relation extraction via bayesian meta-learning on relation graphs. In International conference on machine learning, PMLR, pp 7867–7876
Wu W et al (2021) BASS: Boosting abstractive summarization with unified semantic graph. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers), pp 6052–6067
Peng H et al (2018) Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In Proceedings of the 2018 world wide web conference, pp 1063–1072
Wu Zonghan et al (2020) A comprehensive survey on graph neural networks.". IEEE Trans Neural Netw Learn Syst 32(1):4–24
Wang Quan et al (2017) Knowledge graph embedding: A survey of approaches and applications. IEEE Trans Knowl Data Eng 29(12):2724–2743
Henaff M, Bruna J, LeCun Y (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163
Xu K et al (2018) How powerful are graph neural networks? In International conference on learning representations, pp 1–17
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. In International conference on learning representations, pp 1–14
Zhang Y, Yu X, Cui Z et al (2020) Every document owns its structure: Inductive text classification via graph neural networks. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp 334–339
Peng H, Li J, Wang S et al (2019) Hierarchical taxonomy-aware and attentional graph capsule RCNNs for large-scale multi-label text classification[J]. IEEE Trans Knowl Data Eng 33(6):2505–2519
Yao L, Mao C, Luo Y (2019) Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on artificial intelligence 33(01):7370–7377
Tekli Joe et al (2018) Full-fledged semantic indexing and querying model designed for seamless integration in legacy RDBMS.". Data Knowl Eng 117:133–173
Tekli Joe (2016) An overview on xml semantic disambiguation from unstructured text to semi-structured data: Background, applications, and ongoing challenges. IEEE Trans Knowl Data Eng 28(6):1383–1407
Viji D, Revathy S (2022) A hybrid approach of Weighted Fine-Tuned BERT extraction with deep Siamese Bi–LSTM model for semantic text similarity identification. Multimed Tools Appl 81(5):6131–6157
Lai H et al (2020) Bi-directional attention comparison for semantic sentence matching. Multimedia Tools Applic 79(21):14609–14624
Serban I et al (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the AAAI conference on artificial intelligence 30(1):3776–3783
Ramos J (2003) Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning vol. 242(1)
Robertson S, Zaragoza H (2009) The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc. 3(4):333–389
Yujian Li, Bo Liu (2007) A normalized Levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell 29(6):1091–1095
Huang P-S, et al (2013) Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on information & knowledge management, pp 2333–2338
Feng M et al (2015) Applying deep learning to answer selection: A study and an open task. In 2015 IEEE Workshop on automatic speech recognition and understanding (ASRU), IEEE, pp 813–820
Conneau A et al (2017) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on empirical methods in natural language processing, pp 670–680
Wang H et al (2021) Knowledge-guided paraphrase identification. Findings of the association for computational linguistics: EMNLP 2021, pp 843–853
K Leilei et al (2020) A deep paraphrase identification model interacting semantics with syntax. Complexity 2020:1–14
Mohamed Muhidin, Oussalah Mourad (2020) A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics. Lang Resources Eval 54:457–485
Chen Qidong, Sun Jun, Zhao Yuan (2021) A Novel Architecture with Separate Comparison and Interaction Modules for Chinese Semantic Sentence Matching. Neural Process Lett 53:3677–3692
Chang G, Wang W, Hu S (2022) MatchACNN: A multi-granularity deep matching model. Neural Process Lett 1–20
Pang L et al (2016) Text matching as image recognition. In Proceedings of the AAAI conference on artificial intelligence, vol. 30(1), pp 2793–2799
Ling W et al (2015) Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on empirical methods in natural language processing, pp 1520–1530
Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In Conference on empirical methods in natural language processing, pp. 1520–1530
Hochreiter Sepp, Schmidhuber Jürgen (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Parikh AP et al (2016) A decomposable attention model for natural language inference. In Proceedings of the 2016 conference on empirical methods in natural language processing (EMNLP), pp 2249–2255
Mou L, Men R, Li G et al (2016) Natural language inference by tree-based convolution and heuristic matching[C]. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers), pp 130–136
Minaee S, Kalchbrenner N, Cambria E et al (2021) Deep learning–based text classification: a comprehensive review[J]. ACM Comput Surv (CSUR) 54(3):1–40
Nikolentzos G, Tixier A, Vazirgiannis M (2020) Message passing attention networks for document understanding[C]. Proc AAAI Conf Artif Intell 34(05):8544–8551
Shankar I, Nikhil D, Kornel C (2017) First quora dataset release: Question Pairs. In https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs
Amirreza S et al (2019) Question relatedness on stack overflow: the task, dataset, and corpus-inspired models. In Proceedings of the AAAI reasoning for complex question answering workshop, pp 1–9
Chen D, Fisch A, Weston J et al (2017) Reading wikipedia to answer open-domain questions. In 55th annual meeting of the association for computational linguistics, ACL 2017, pp 1870–1879
Funding
This work was supported by the Science Foundation of Young and Middle-aged Academic and Technical Leaders of Yunnan under Grant No. 202205AC160040; Science Foundation of Yunnan Jinzhi Expert Workstation under Grant No. 202205AF150006; Major Project of Yunnan Natural Science Foundation under Grant No. 202302AE09002003; Science and Technology Project of Yunnan Power Grid Co., Ltd. under Grant No.YNKJXM20222254; the Postgraduate Research and Innovation Foundation of Yunnan University under Grant No. 2021Z112; Science Foundation of “Knowledge-driven intelligent software engineering innovation team”.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest/Competing interests
We confirm that this work is original and has either not been published elsewhere, or is currently under consideration for publication elsewhere. None of the authors have any competing interests in the manuscript.
Consent to participate
All the authors are aware of this submission. They have reviewed and consented to participate in this journal submission.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Du, K., Zhang, X., Gao, C. et al. GIMM: A graph convolutional network-based paraphrase identification model to detecting duplicate questions in QA communities. Multimed Tools Appl 83, 31251–31278 (2024). https://doi.org/10.1007/s11042-023-16592-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16592-3