research-article

Duplicate Detection in Programming Question Answering Communities

Authors:

Wei Emma Zhang,

Wenjie RuanAuthors Info & Claims

ACM Transactions on Internet Technology (TOIT), Volume 18, Issue 3

Article No.: 37, Pages 1 - 21

https://doi.org/10.1145/3169795

Published: 17 April 2018 Publication History

Abstract

Community-based Question Answering (CQA) websites are attracting increasing numbers of users and contributors in recent years. However, duplicate questions frequently occur in CQA websites and are currently manually identified by the moderators. Automatic duplicate detection, on one hand, alleviates this laborious effort for moderators before taking close actions, and, on the other hand, helps question issuers quickly find answers. A number of studies have looked into related problems, but very limited works target Duplicate Detection in Programming CQA (PCQA), a branch of CQA that is dedicated to programmers. Existing works framed the task as a supervised learning problem on the question pairs and relied on only textual features. Moreover, the issue of selecting candidate duplicates from large volumes of historical questions is often un-addressed. To tackle these issues, we model duplicate detection as a two-stage “ranking-classification” problem over question pairs. In the first stage, we rank the historical questions according to their similarities to the newly issued question and select the top ranked ones as candidates to reduce the search space. In the second stage, we develop novel features that capture both textual similarity and latent semantics on question pairs, leveraging techniques in deep learning and information retrieval literature. Experiments on real-world questions about multiple programming languages demonstrate that our method works very well; in some cases, up to 25% improvement compared to the state-of-the-art benchmarks.

References

[1]

Muhammad Ahasanuzzaman, Muhammad Asaduzzaman, Chanchal K. Roy, and Kevin A. Schneider. Mining duplicate questions in stack overflow. In Proceedings of of the MSR 2016. ACM, Austin, Texas, USA, 402--412.

Digital Library

[2]

Naomi S. Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 3 (1992), 175--185.

[3]

Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems 20, 4 (2002), 357--389.

Digital Library

[4]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the EMNLP 2013. ACL, Seattle, Washington, USA, 1533--1544.

[5]

Jonathan Berant and Percy Liang. Semantic parsing via paraphrasing. In Proceedings of the ACL 2014. 1415--1425.

[6]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022.

Digital Library

[7]

Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Wadsworth.

[8]

Xin Cao, Gao Cong, Bin Cui, and Christian S. Jensen. A generalized framework of exploring category information for question retrieval in community question answer archives. In Proceedings of the WWW 2010. ACM, Raleigh, North Carolina, USA, 201--210.

Digital Library

[9]

Xin Cao, Gao Cong, Bin Cui, Christian S. Jensen, and Quan Yuan. 2012. Approaches to exploring category information for question retrieval in community question-answer archives. ACM Transactions on Information Systems 30, 2 (2012), 7.

Digital Library

[10]

Tony F. Chan, Gene Howard Golub, and Randall J. LeVeque. Updating formulae and a pairwise algorithm for computing sample variances. In Proceedings of the COMPSTAT 1982. Springer, Physica, Heidelberg, 30--41.

[11]

Stéphane Clinchant and Éric Gaussier. Information-based models for ad hoc IR. In Proceedings of the SIGIR 2010. ACM, Geneva, Switzerland, 234--241.

Digital Library

[12]

Michael Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the EMNLP 2002. ACL, Philadelphia, PA, USA, 1--8.

Digital Library

[13]

Denzil Correa and Ashish Sureka. Chaff from the wheat: Characterization and modeling of deleted questions on stack overflow. In Proceedings of the WWW 2014. ACM, Seoul, Republic of Korea, 631--642.

Digital Library

[14]

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2006. Online passive-aggressive algorithms. Journal of Machine Learning Research 7 (2006), 551--585.

Digital Library

[15]

C. Fellbaum. 1998. WordNet: An electronic lexical database. MIT Press.

[16]

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the EuroCOLT 1995. Springer, Barcelona, Spain, 23--37.

Digital Library

[17]

Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263--1284.

Digital Library

[18]

Hua He, Kevin Gimpel, and Jimmy J. Lin. Multi-perspective sentence similarity modeling with convolutional neural networks. In Proceedings of the EMNLP 2015. ACL, Lisbon, Portugal, 1576--1586.

[19]

Marti A. Hearst, Susan T. Dumais, Edgar Osman, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and Their Applications 13, 4 (1998), 18--28.

Digital Library

[20]

Tin Kam Ho. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 8 (1998), 832--844.

Digital Library

[21]

Fred Jelinek and Robert L. Mercer. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the PRNI 1980. North Holland, Amsterdam, Netherlands, 381--397.

[22]

Yangfeng Ji and Jacob Eisenstein. Discriminative improvements to distributional sentence similarity. In Proceedings of the EMNLP 2013. ACL, Seattle, Washington, USA, 891--896.

[23]

Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the RepL4NLP 2016. ACL, Berlin, Germany, 78--86.

[24]

Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of the ICML 2014. JMLR.org ©2014, Beijing, China, 1188--1196.

Digital Library

[25]

Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the SIGIR 2016. ACM, Pisa, Italy, 165--174.

Digital Library

[26]

Nitin Madnani, Joel R. Tetreault, and Martin Chodorow. Re-examining machine translation metrics for paraphrase identification. In Proceedings of the NAACL 2012. ACL, Montréal, Canada, 182--190.

Digital Library

[27]

Rada Mihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the AAAI 2006. AAAI Press, Boston, Massachusetts, USA, 775--780.

Digital Library

[28]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS 2013. Neural Information Processing Systems, Lake Tahoe, Nevada, USA, 3111--3119.

Digital Library

[29]

Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1 (2003), 19--51.

Digital Library

[30]

Franz Josef Och and Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics 30, 4 (2004), 417--449.

Digital Library

[31]

Long Qiu, Min-Yen Kan, and Tat-Seng Chua. Paraphrase recognition via dissimilarity significance classification. In Proceedings of the EMNLP 2006. ACL, Sydney, Australia, 18--26.

Digital Library

[32]

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline M. Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the TREC 1994. National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA, 109--126.

[33]

Anna Shtok, Gideon Dror, Yoelle Maarek, and Idan Szpektor. Learning from the past: Answering new questions with past answers. In Proceedings of the WWW 2012. ACM, Lyon, France, 759--768.

Digital Library

[34]

Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the NIPS 2011. Neural Information Processing Systems, Lake Tahoe, Nevada, United States, 801--809.

Digital Library

[35]

Christoph Treude, Ohad Barzilay, and Margaret-Anne D. Storey. How do programmers ask and answer questions on the web? In Proceedings of the ICSE 2011. ACM, Waikiki, Honolulu, HI, USA, 804--807.

Digital Library

[36]

Strother H. Walker and David B. Duncan. 1967. Estimation of the probability of an event as a function of several independent variables. Biometrika 54, 1--2 (1967), 167--179.

[37]

Kai Wang, Zhaoyan Ming, and Tat-Seng Chua. A syntactic tree matching approach to finding similar questions in community-based QA services. In Proceedings of the SIGIR 2009. ACM, Boston, MA, USA, 187--194.

Digital Library

[38]

Lichun Yang, Shenghua Bao, Qingliang Lin, Xian Wu, Dingyi Han, Zhong Su, and Yong Yu. Analyzing and predicting not-answered questions in community-based question answering services. In Proceedings of the AAAI 2011. AAAI Press, San Francisco, California, USA, 1273--1278.

Digital Library

[39]

Pengcheng Yin, Nan Duan, Ben Kao, Jun-Wei Bao, and Ming Zhou. Answering questions with complex semantic constraints on open knowledge bases. In Proceedings of the CIKM 2015. ACM, Melbourne, Australia, 1301--1310.

Digital Library

[40]

Cheng Xiang Zhai and John D. Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214.

Digital Library

[41]

Tong Zhang. 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the ICML 2004. ACM, Banff, Alberta, Canada, 919--926.

Digital Library

[42]

Wei Emma Zhang, Quan Z. Sheng, Jey Han Lau, and Ermyas Abebe. Detecting duplicate posts in programming QA communities via latent semantics and association rules. In Proceedings of the WWW 2017. ACM, Perth, Australia, 1221--1229.

Digital Library

[43]

Yun Zhang, David Lo, Xin Xia, and Jianling Sun. 2015. Multi-factor duplicate question detection in stack overflow. Journal of Computer Science and Technology 30, 5 (2015), 981--997.

[44]

Guangyou Zhou, Yang Liu, Fang Liu, Daojian Zeng, and Jun Zhao. Improving question retrieval in community question answering using world knowledge. In Proceedings of the IJCAI 2013. IJCAI/AAAI, Beijing, China, 2239--2245.

Digital Library

Cited By

Wu XLi HYoshioka NWashizaki HKhomh F(2024)Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00019(114-125)Online publication date: 12-Mar-2024
https://doi.org/10.1109/SANER60148.2024.00019
Lima MSteinmacher IFord DLiu EVorreuter GConte TGadelha B(2023)Looking for related posts on GitHub discussionsPeerJ Computer Science10.7717/peerj-cs.15679(e1567)Online publication date: 9-Nov-2023
https://doi.org/10.7717/peerj-cs.1567
Shu HGao PYang ZLi CWu M(2022)Exploring the Feasibility of Transformer Based Models on Question Relatedness2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00136(831-838)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00136
Show More Cited By

Index Terms

Duplicate Detection in Programming Question Answering Communities
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Near-duplicate and plagiarism detection
      2. Question answering
  2. Information systems applications
    1. Data mining
      1. Association rules

Recommendations

DeepDup: Duplicate Question Detection in Community Question Answering
ICDLT '21: Proceedings of the 2021 5th International Conference on Deep Learning Technologies

Duplicate question detection is an ongoing challenge in community question answering because semantically equivalent questions can have significantly different words and structures. The identification of duplicate questions can reduce the resources ...
Adaptive Multi-Attention Network Incorporating Answer Information for Duplicate Question Detection
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Community-based question answering (CQA), which provides a platform for people with diverse backgrounds to share information and knowledge, has become increasingly popular. With the accumulation of site data, methods to detect duplicate questions in CQA ...
Detecting Duplicate Posts in Programming QA Communities via Latent Semantics and Association Rules
WWW '17: Proceedings of the 26th International Conference on World Wide Web

Programming community-based question-answering (PCQA) websites such as Stack Overflow enable programmers to find working solutions to their questions. Despite detailed posting guidelines, duplicate questions that have been answered are frequently ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology

ACM Transactions on Internet Technology Volume 18, Issue 3

Special Issue on Artificial Intelligence for Secruity and Privacy and Regular Papers

August 2018

314 pages

ISSN:1533-5399

EISSN:1557-6051

DOI:10.1145/3185332

Editor:
Munindar P. Singh
Department of Computer Science, North Carolina State University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2018

Accepted: 01 November 2017

Revised: 01 October 2017

Received: 01 June 2017

Published in TOIT Volume 18, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Australian Research Council (ARC) Future Fellowship

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
408
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu XLi HYoshioka NWashizaki HKhomh F(2024)Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00019(114-125)Online publication date: 12-Mar-2024
https://doi.org/10.1109/SANER60148.2024.00019
Lima MSteinmacher IFord DLiu EVorreuter GConte TGadelha B(2023)Looking for related posts on GitHub discussionsPeerJ Computer Science10.7717/peerj-cs.15679(e1567)Online publication date: 9-Nov-2023
https://doi.org/10.7717/peerj-cs.1567
Shu HGao PYang ZLi CWu M(2022)Exploring the Feasibility of Transformer Based Models on Question Relatedness2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00136(831-838)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00136
Roy PSaumya SSingh JBanerjee SGutub A(2022)Analysis of community question‐answering issues via machine learning and deep learning: State‐of‐the‐art reviewCAAI Transactions on Intelligence Technology10.1049/cit2.120818:1(95-117)Online publication date: 4-May-2022
https://doi.org/10.1049/cit2.12081
Jeyaraj ST. R(2022)A deep learning based end-to-end system (F-Gen) for automated email FAQ generationExpert Systems with Applications10.1016/j.eswa.2021.115896187(115896)Online publication date: Jan-2022
https://doi.org/10.1016/j.eswa.2021.115896
Kumar SChauhan A(2022)Augmenting Textbooks with cQA Question-Answers and Annotated YouTube Videos to Increase Its RelevanceNeural Processing Letters10.1007/s11063-022-10897-455:1(551-588)Online publication date: 30-Jun-2022
https://doi.org/10.1007/s11063-022-10897-4
Wu HSang QWang YMa HXing L(2021)Copy Adaptive Routing Algorithm Based on Network Connectivity in Flying Ad Hoc NetworksMobile Information Systems10.1155/2021/85807952021Online publication date: 4-Oct-2021
https://dl.acm.org/doi/10.1155/2021/8580795
Pei JWu YQin ZCong YGuan J(2021)Attention-based model for predicting question relatedness on Stack Overflow2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)10.1109/MSR52588.2021.00023(97-107)Online publication date: May-2021
https://doi.org/10.1109/MSR52588.2021.00023
Luo ZXu LXu ZYan MLei YLi C(2021)Contextual-Semantic-Aware Linkable Knowledge Prediction in Stack Overflow via Self-Attention2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00024(115-126)Online publication date: Oct-2021
https://doi.org/10.1109/ISSRE52982.2021.00024
Zhang YJin ZZhu WChi LWang W(2021)HDQGF:Heterogeneous Data Quality Guarantee Framework Based on Deep Learning2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD49262.2021.9437684(901-906)Online publication date: 5-May-2021
https://doi.org/10.1109/CSCWD49262.2021.9437684
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents