Skip to main content
Log in

GIMM: A graph convolutional network-based paraphrase identification model to detecting duplicate questions in QA communities

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Paraphrase Identification (PI) is an important task in Natural Language Processing (NLP), which aims to detect whether two sentences expressed in various forms are semantically consistent. It can be used to solve the problem of duplicate detection in QA Communities (eg: Quora and Stack Overflow). There have many studies that applied Convolutional Neural Networks to capture rich matching information between sentence pairs layer by layer. However, only a limited number of studies have explored the more flexible Graph Convolutional Networks (GCNs) for this task. GCN operates directly on the graph, and learns the representation of the node according to the neighborhood information of nodes. Thus, the interactive information between two sentences can be effectively integrated based on the local graph structure. In this paper, a Graph-based Interaction Matching model (GIMM) for PI is proposed. GIMM takes each word as a node, the word co-occurrence relations between sentence pairs, and the phrase relations within a single sentence as the relations between nodes to build the interaction graph. Then, the GCN are applied to learn the richer word representations based on the local structure of the graph. Finally, the node representations are aligned by the Attention mechanism to obtain the matching vector, and the results of PI are obtained by the Fully Connected Layer. We conduct experiments to compare the performance of GIMM with the current baselines on the Quora and Stack Overflow datasets. Experimental results demonstrate that the proposed model achieves excellent performance on both of these datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The data associated with our study can be made available upon request. Please contact the corresponding author Xuan Zhang (zhxuan@ynu.edu.cn).

Notes

  1. Our pre-trained word embeddings are openly available in https://nlp.stanford.edu/projects

  2. https://www.nltk.org/

  3. https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

  4. https://anonymousaaai2019.github.io/

References

  1. Xiao Han (2020) Hungarian layer: A novel interpretable neural layer for paraphrase identification. Neural Netw 131:172–184

    Article  PubMed  Google Scholar 

  2. Callison-Burch, C, Koehn P, Osborne M (2006) Improved statistical machine translation using paraphrases. In Proceedings of the human language technology conference of the NAACL, Main Conference, pp 17–24

  3. Wallis P (1993) Information retrieval based on paraphrase. In Proceedings of pacling conference, pp 118–126

  4. Das Arijit, Saha Diganta (2022) Deep learning based Bengali question answering system using semantic textual similarity. Multimedia Tools Applic 81(1):589–613

    Article  Google Scholar 

  5. Lukashenko R, Graudina V, Grundspenkis J (2007) Computer-based plagiarism detection methods and tools: an overview. In Proceedings of the 2007 international conference on computer systems and technologies, pp 1–6

  6. Zhang Yun et al (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997

    Article  Google Scholar 

  7. Ahasanuzzaman M et al (2016) Mining duplicate questions of stack overflow. In 2016 IEEE/ACM 13th working conference on mining software repositories (MSR), IEEE, pp 402–412

  8. Hoogeveen D, Verspoor KM, Baldwin T (2015) CQADupStack: A benchmark data set for community question-answering research. In Proceedings of the 20th australasian document computing symposium, pp 1–8

  9. Nakov P et al (2016) SemEval-2016 Task 3: Community Question Answering. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), pp 525–545

  10. Gong Y, Luo H, Zhang J (2018) Natural language inference over interaction space. In International conference on learning representations, pp 1–15

  11. Chen Q et al (2017) Enhanced LSTM for natural language inference. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long Papers), pp 1657–1668

  12. Wang Z, Hamza W, Florian R (n.d.) Bilateral multi-perspective matching for natural language sentences. In Proceedings of the 26th international joint conference on artificial intelligence (IJCAI-17), pp 4144–4150

  13. Arase Yuki, Tsujii Junichi (2021) Transfer fine-tuning of BERT with phrasal paraphrases. Comput Speech Lang 66:101164

    Article  Google Scholar 

  14. Palivela Hemant (2021) Optimization of paraphrase generation and identification using language models in natural language processing. Int J Inform Manag Data Insights 1(2):100025

    Google Scholar 

  15. Mueller J, Thyagarajan A (2016) Siamese recurrent architectures for learning sentence similarity. In Proceedings of the AAAI conference on artificial intelligence 30(1):2786–2792

  16. Wang Z, Mi H, Ittycheriah A (2016) Sentence similarity learning by lexical decomposition and composition. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers, pp 1340–1349

  17. Yang R, Zhang J, Gao X et al (2019) Simple and effective text matching with richer alignment features[C]. In Proceedings of the 57th annual meeting of the association for computational linguistics, pp 4699–4709

  18. Yu Chuanming et al (2021) A simple and efficient text matching model based on deep interaction. Inform Proc Manag 58(6):102738

    Article  Google Scholar 

  19. Xint Y et al (2021) Label incorporated graph neural networks for text classification. In 2020 25th international conference on pattern recognition (ICPR), IEEE, pp 8892–8898

  20. Liu Y et al (2021) Deep attention diffusion graph neural networks for text classification. In Proceedings of the 2021 conference on empirical methods in natural language processing, pp 8892–8898

  21. Luo Y, Zhao H (2020) Bipartite flat-graph network for nested named entity recognition. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp 6408–6418

  22. Qu M et al (2020) Few-shot relation extraction via bayesian meta-learning on relation graphs. In International conference on machine learning, PMLR, pp 7867–7876

  23. Wu W et al (2021) BASS: Boosting abstractive summarization with unified semantic graph. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers), pp 6052–6067

  24. Peng H et al (2018) Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In Proceedings of the 2018 world wide web conference, pp 1063–1072

  25. Wu Zonghan et al (2020) A comprehensive survey on graph neural networks.". IEEE Trans Neural Netw Learn Syst 32(1):4–24

    Article  MathSciNet  Google Scholar 

  26. Wang Quan et al (2017) Knowledge graph embedding: A survey of approaches and applications. IEEE Trans Knowl Data Eng 29(12):2724–2743

    Article  Google Scholar 

  27. Henaff M, Bruna J, LeCun Y (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163

  28. Xu K et al (2018) How powerful are graph neural networks? In International conference on learning representations, pp 1–17

  29. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. In International conference on learning representations, pp 1–14

  30. Zhang Y, Yu X, Cui Z et al (2020) Every document owns its structure: Inductive text classification via graph neural networks. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp 334–339

  31. Peng H, Li J, Wang S et al (2019) Hierarchical taxonomy-aware and attentional graph capsule RCNNs for large-scale multi-label text classification[J]. IEEE Trans Knowl Data Eng 33(6):2505–2519

    Article  Google Scholar 

  32. Yao L, Mao C, Luo Y (2019) Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on artificial intelligence 33(01):7370–7377

  33. Tekli Joe et al (2018) Full-fledged semantic indexing and querying model designed for seamless integration in legacy RDBMS.". Data Knowl Eng 117:133–173

    Article  Google Scholar 

  34. Tekli Joe (2016) An overview on xml semantic disambiguation from unstructured text to semi-structured data: Background, applications, and ongoing challenges. IEEE Trans Knowl Data Eng 28(6):1383–1407

    Article  Google Scholar 

  35. Viji D, Revathy S (2022) A hybrid approach of Weighted Fine-Tuned BERT extraction with deep Siamese Bi–LSTM model for semantic text similarity identification. Multimed Tools Appl 81(5):6131–6157

  36. Lai H et al (2020) Bi-directional attention comparison for semantic sentence matching. Multimedia Tools Applic 79(21):14609–14624

    Article  Google Scholar 

  37. Serban I et al (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the AAAI conference on artificial intelligence 30(1):3776–3783

  38. Ramos J (2003) Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning vol. 242(1)

  39. Robertson S, Zaragoza H (2009) The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc. 3(4):333–389

  40. Yujian Li, Bo Liu (2007) A normalized Levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell 29(6):1091–1095

    Article  PubMed  Google Scholar 

  41. Huang P-S, et al (2013) Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on information & knowledge management, pp 2333–2338

  42. Feng M et al (2015) Applying deep learning to answer selection: A study and an open task. In 2015 IEEE Workshop on automatic speech recognition and understanding (ASRU), IEEE, pp 813–820

  43. Conneau A et al (2017) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on empirical methods in natural language processing, pp 670–680

  44. Wang H et al (2021) Knowledge-guided paraphrase identification. Findings of the association for computational linguistics: EMNLP 2021, pp 843–853

  45. K Leilei et al (2020) A deep paraphrase identification model interacting semantics with syntax. Complexity 2020:1–14

    ADS  Google Scholar 

  46. Mohamed Muhidin, Oussalah Mourad (2020) A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics. Lang Resources Eval 54:457–485

    Article  Google Scholar 

  47. Chen Qidong, Sun Jun, Zhao Yuan (2021) A Novel Architecture with Separate Comparison and Interaction Modules for Chinese Semantic Sentence Matching. Neural Process Lett 53:3677–3692

    Article  Google Scholar 

  48. Chang G, Wang W, Hu S (2022) MatchACNN: A multi-granularity deep matching model. Neural Process Lett 1–20

  49. Pang L et al (2016) Text matching as image recognition. In Proceedings of the AAAI conference on artificial intelligence, vol. 30(1), pp 2793–2799

  50. Ling W et al (2015) Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on empirical methods in natural language processing, pp 1520–1530

  51. Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In Conference on empirical methods in natural language processing, pp. 1520–1530

  52. Hochreiter Sepp, Schmidhuber Jürgen (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  CAS  PubMed  Google Scholar 

  53. Parikh AP et al (2016) A decomposable attention model for natural language inference. In Proceedings of the 2016 conference on empirical methods in natural language processing (EMNLP), pp 2249–2255

  54. Mou L, Men R, Li G et al (2016) Natural language inference by tree-based convolution and heuristic matching[C]. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers), pp 130–136

  55. Minaee S, Kalchbrenner N, Cambria E et al (2021) Deep learning–based text classification: a comprehensive review[J]. ACM Comput Surv (CSUR) 54(3):1–40

    Article  Google Scholar 

  56. Nikolentzos G, Tixier A, Vazirgiannis M (2020) Message passing attention networks for document understanding[C]. Proc AAAI Conf Artif Intell 34(05):8544–8551

    Google Scholar 

  57. Shankar I, Nikhil D, Kornel C (2017) First quora dataset release: Question Pairs. In https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

  58. Amirreza S et al (2019) Question relatedness on stack overflow: the task, dataset, and corpus-inspired models. In Proceedings of the AAAI reasoning for complex question answering workshop, pp 1–9

  59. Chen D, Fisch A, Weston J et al (2017) Reading wikipedia to answer open-domain questions. In 55th annual meeting of the association for computational linguistics, ACL 2017, pp 1870–1879

Download references

Funding

This work was supported by the Science Foundation of Young and Middle-aged Academic and Technical Leaders of Yunnan under Grant No. 202205AC160040; Science Foundation of Yunnan Jinzhi Expert Workstation under Grant No. 202205AF150006; Major Project of Yunnan Natural Science Foundation under Grant No. 202302AE09002003; Science and Technology Project of Yunnan Power Grid Co., Ltd. under Grant No.YNKJXM20222254; the Postgraduate Research and Innovation Foundation of Yunnan University under Grant No. 2021Z112; Science Foundation of “Knowledge-driven intelligent software engineering innovation team”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuan Zhang.

Ethics declarations

Conflicts of interest/Competing interests

We confirm that this work is original and has either not been published elsewhere, or is currently under consideration for publication elsewhere. None of the authors have any competing interests in the manuscript.

Consent to participate

All the authors are aware of this submission. They have reviewed and consented to participate in this journal submission.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Du, K., Zhang, X., Gao, C. et al. GIMM: A graph convolutional network-based paraphrase identification model to detecting duplicate questions in QA communities. Multimed Tools Appl 83, 31251–31278 (2024). https://doi.org/10.1007/s11042-023-16592-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16592-3

Keywords

Navigation