skip to main content
research-article

Duplicate Detection in Programming Question Answering Communities

Published: 17 April 2018 Publication History

Abstract

Community-based Question Answering (CQA) websites are attracting increasing numbers of users and contributors in recent years. However, duplicate questions frequently occur in CQA websites and are currently manually identified by the moderators. Automatic duplicate detection, on one hand, alleviates this laborious effort for moderators before taking close actions, and, on the other hand, helps question issuers quickly find answers. A number of studies have looked into related problems, but very limited works target Duplicate Detection in Programming CQA (PCQA), a branch of CQA that is dedicated to programmers. Existing works framed the task as a supervised learning problem on the question pairs and relied on only textual features. Moreover, the issue of selecting candidate duplicates from large volumes of historical questions is often un-addressed. To tackle these issues, we model duplicate detection as a two-stage “ranking-classification” problem over question pairs. In the first stage, we rank the historical questions according to their similarities to the newly issued question and select the top ranked ones as candidates to reduce the search space. In the second stage, we develop novel features that capture both textual similarity and latent semantics on question pairs, leveraging techniques in deep learning and information retrieval literature. Experiments on real-world questions about multiple programming languages demonstrate that our method works very well; in some cases, up to 25% improvement compared to the state-of-the-art benchmarks.

References

[1]
Muhammad Ahasanuzzaman, Muhammad Asaduzzaman, Chanchal K. Roy, and Kevin A. Schneider. Mining duplicate questions in stack overflow. In Proceedings of of the MSR 2016. ACM, Austin, Texas, USA, 402--412.
[2]
Naomi S. Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 3 (1992), 175--185.
[3]
Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems 20, 4 (2002), 357--389.
[4]
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the EMNLP 2013. ACL, Seattle, Washington, USA, 1533--1544.
[5]
Jonathan Berant and Percy Liang. Semantic parsing via paraphrasing. In Proceedings of the ACL 2014. 1415--1425.
[6]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022.
[7]
Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Wadsworth.
[8]
Xin Cao, Gao Cong, Bin Cui, and Christian S. Jensen. A generalized framework of exploring category information for question retrieval in community question answer archives. In Proceedings of the WWW 2010. ACM, Raleigh, North Carolina, USA, 201--210.
[9]
Xin Cao, Gao Cong, Bin Cui, Christian S. Jensen, and Quan Yuan. 2012. Approaches to exploring category information for question retrieval in community question-answer archives. ACM Transactions on Information Systems 30, 2 (2012), 7.
[10]
Tony F. Chan, Gene Howard Golub, and Randall J. LeVeque. Updating formulae and a pairwise algorithm for computing sample variances. In Proceedings of the COMPSTAT 1982. Springer, Physica, Heidelberg, 30--41.
[11]
Stéphane Clinchant and Éric Gaussier. Information-based models for ad hoc IR. In Proceedings of the SIGIR 2010. ACM, Geneva, Switzerland, 234--241.
[12]
Michael Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the EMNLP 2002. ACL, Philadelphia, PA, USA, 1--8.
[13]
Denzil Correa and Ashish Sureka. Chaff from the wheat: Characterization and modeling of deleted questions on stack overflow. In Proceedings of the WWW 2014. ACM, Seoul, Republic of Korea, 631--642.
[14]
Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2006. Online passive-aggressive algorithms. Journal of Machine Learning Research 7 (2006), 551--585.
[15]
C. Fellbaum. 1998. WordNet: An electronic lexical database. MIT Press.
[16]
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the EuroCOLT 1995. Springer, Barcelona, Spain, 23--37.
[17]
Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263--1284.
[18]
Hua He, Kevin Gimpel, and Jimmy J. Lin. Multi-perspective sentence similarity modeling with convolutional neural networks. In Proceedings of the EMNLP 2015. ACL, Lisbon, Portugal, 1576--1586.
[19]
Marti A. Hearst, Susan T. Dumais, Edgar Osman, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and Their Applications 13, 4 (1998), 18--28.
[20]
Tin Kam Ho. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 8 (1998), 832--844.
[21]
Fred Jelinek and Robert L. Mercer. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the PRNI 1980. North Holland, Amsterdam, Netherlands, 381--397.
[22]
Yangfeng Ji and Jacob Eisenstein. Discriminative improvements to distributional sentence similarity. In Proceedings of the EMNLP 2013. ACL, Seattle, Washington, USA, 891--896.
[23]
Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the RepL4NLP 2016. ACL, Berlin, Germany, 78--86.
[24]
Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of the ICML 2014. JMLR.org ©2014, Beijing, China, 1188--1196.
[25]
Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the SIGIR 2016. ACM, Pisa, Italy, 165--174.
[26]
Nitin Madnani, Joel R. Tetreault, and Martin Chodorow. Re-examining machine translation metrics for paraphrase identification. In Proceedings of the NAACL 2012. ACL, Montréal, Canada, 182--190.
[27]
Rada Mihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the AAAI 2006. AAAI Press, Boston, Massachusetts, USA, 775--780.
[28]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS 2013. Neural Information Processing Systems, Lake Tahoe, Nevada, USA, 3111--3119.
[29]
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1 (2003), 19--51.
[30]
Franz Josef Och and Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics 30, 4 (2004), 417--449.
[31]
Long Qiu, Min-Yen Kan, and Tat-Seng Chua. Paraphrase recognition via dissimilarity significance classification. In Proceedings of the EMNLP 2006. ACL, Sydney, Australia, 18--26.
[32]
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline M. Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the TREC 1994. National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA, 109--126.
[33]
Anna Shtok, Gideon Dror, Yoelle Maarek, and Idan Szpektor. Learning from the past: Answering new questions with past answers. In Proceedings of the WWW 2012. ACM, Lyon, France, 759--768.
[34]
Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the NIPS 2011. Neural Information Processing Systems, Lake Tahoe, Nevada, United States, 801--809.
[35]
Christoph Treude, Ohad Barzilay, and Margaret-Anne D. Storey. How do programmers ask and answer questions on the web? In Proceedings of the ICSE 2011. ACM, Waikiki, Honolulu, HI, USA, 804--807.
[36]
Strother H. Walker and David B. Duncan. 1967. Estimation of the probability of an event as a function of several independent variables. Biometrika 54, 1--2 (1967), 167--179.
[37]
Kai Wang, Zhaoyan Ming, and Tat-Seng Chua. A syntactic tree matching approach to finding similar questions in community-based QA services. In Proceedings of the SIGIR 2009. ACM, Boston, MA, USA, 187--194.
[38]
Lichun Yang, Shenghua Bao, Qingliang Lin, Xian Wu, Dingyi Han, Zhong Su, and Yong Yu. Analyzing and predicting not-answered questions in community-based question answering services. In Proceedings of the AAAI 2011. AAAI Press, San Francisco, California, USA, 1273--1278.
[39]
Pengcheng Yin, Nan Duan, Ben Kao, Jun-Wei Bao, and Ming Zhou. Answering questions with complex semantic constraints on open knowledge bases. In Proceedings of the CIKM 2015. ACM, Melbourne, Australia, 1301--1310.
[40]
Cheng Xiang Zhai and John D. Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214.
[41]
Tong Zhang. 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the ICML 2004. ACM, Banff, Alberta, Canada, 919--926.
[42]
Wei Emma Zhang, Quan Z. Sheng, Jey Han Lau, and Ermyas Abebe. Detecting duplicate posts in programming QA communities via latent semantics and association rules. In Proceedings of the WWW 2017. ACM, Perth, Australia, 1221--1229.
[43]
Yun Zhang, David Lo, Xin Xia, and Jianling Sun. 2015. Multi-factor duplicate question detection in stack overflow. Journal of Computer Science and Technology 30, 5 (2015), 981--997.
[44]
Guangyou Zhou, Yang Liu, Fang Liu, Daojian Zeng, and Jun Zhao. Improving question retrieval in community question answering using world knowledge. In Proceedings of the IJCAI 2013. IJCAI/AAAI, Beijing, China, 2239--2245.

Cited By

View all
  • (2024)Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00019(114-125)Online publication date: 12-Mar-2024
  • (2023)Looking for related posts on GitHub discussionsPeerJ Computer Science10.7717/peerj-cs.15679(e1567)Online publication date: 9-Nov-2023
  • (2022)Exploring the Feasibility of Transformer Based Models on Question Relatedness2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00136(831-838)Online publication date: Dec-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology
ACM Transactions on Internet Technology  Volume 18, Issue 3
Special Issue on Artificial Intelligence for Secruity and Privacy and Regular Papers
August 2018
314 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/3185332
  • Editor:
  • Munindar P. Singh
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2018
Accepted: 01 November 2017
Revised: 01 October 2017
Received: 01 June 2017
Published in TOIT Volume 18, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Community-based question answering
  2. association rules
  3. classification
  4. latent semantics
  5. question quality

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Australian Research Council (ARC) Future Fellowship

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00019(114-125)Online publication date: 12-Mar-2024
  • (2023)Looking for related posts on GitHub discussionsPeerJ Computer Science10.7717/peerj-cs.15679(e1567)Online publication date: 9-Nov-2023
  • (2022)Exploring the Feasibility of Transformer Based Models on Question Relatedness2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00136(831-838)Online publication date: Dec-2022
  • (2022)Analysis of community question‐answering issues via machine learning and deep learning: State‐of‐the‐art reviewCAAI Transactions on Intelligence Technology10.1049/cit2.120818:1(95-117)Online publication date: 4-May-2022
  • (2022)A deep learning based end-to-end system (F-Gen) for automated email FAQ generationExpert Systems with Applications10.1016/j.eswa.2021.115896187(115896)Online publication date: Jan-2022
  • (2022)Augmenting Textbooks with cQA Question-Answers and Annotated YouTube Videos to Increase Its RelevanceNeural Processing Letters10.1007/s11063-022-10897-455:1(551-588)Online publication date: 30-Jun-2022
  • (2021)Copy Adaptive Routing Algorithm Based on Network Connectivity in Flying Ad Hoc NetworksMobile Information Systems10.1155/2021/85807952021Online publication date: 4-Oct-2021
  • (2021)Attention-based model for predicting question relatedness on Stack Overflow2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)10.1109/MSR52588.2021.00023(97-107)Online publication date: May-2021
  • (2021)Contextual-Semantic-Aware Linkable Knowledge Prediction in Stack Overflow via Self-Attention2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00024(115-126)Online publication date: Oct-2021
  • (2021)HDQGF:Heterogeneous Data Quality Guarantee Framework Based on Deep Learning2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD49262.2021.9437684(901-906)Online publication date: 5-May-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media