Abstract
The design and maintenance of APIs (Application Programming Interfaces) are complex tasks due to the constantly changing requirements of their users. Despite the efforts of their designers, APIs may suffer from a number of issues (such as incomplete or erroneous documentation, poor performance, and backward incompatibility). To maintain a healthy client base, API designers must learn these issues to fix them. Question answering sites, such as Stack Overflow (SO), have become a popular place for discussing API issues. These posts about API issues are invaluable to API designers, not only because they can help to learn more about the problem but also because they can facilitate learning the requirements of API users. However, the unstructured nature of posts and the abundance of non-issue posts make the task of detecting SO posts concerning API issues difficult and challenging. In this paper, we first develop a supervised learning approach using a Conditional Random Field (CRF), a statistical modeling method, to identify API issue-related sentences. We use the above information together with different features collected from posts, the experience of users, readability metrics and centrality measures of collaboration network to build a technique, called CAPS, that can classify SO posts concerning API issues. In total, we consider 34 features along eight different dimensions. Evaluation of CAPS using carefully curated SO posts on three popular API types reveals that the technique outperforms all three baseline approaches we consider in this study. We then conduct studies to find important features and also evaluate the performance of the CRF-based technique for classifying issue sentences. Comparison with two other baseline approaches shows that the technique has high potential. We also test the generalizability of CAPS results, evaluate the effectiveness of different classifiers, and identify the impact of different feature sets.
Similar content being viewed by others
Notes
References
Aggarwal K, Timbers F, Rutgers T, Hindle A, Stroulia E, Greiner R (2017) Detecting duplicate bug reports with software engineering domain knowledge. Journal of Software: Evolution and Process 29(3):e1821
Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2018) Classifying stack overflow posts on API issues. In: Proceedings of the 25th international conference on software analysis, evolution and reengineering, SANER ’18, pp 244–254
Ahmed T, Bosu A, Iqbal A, Rahimi S (2017) SentiCR: a customized sentiment analysis tool for code review interactions. In: Proceedings of the 32nd International Conference on Automated Software Engineering, ASE ’17, pp 106–111
Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2018) Classifying Stack Overflow Posts on API Issues. In Proceedings of the 25th International Conference on Software Analysis, Evolution and Reengineering, SANER ?18, pages 244–254
Allison PD (2012) Logistic Regression Using SAS: Theory and Application. 2nd edn
Asaduzzaman M, Mashiyat AS, Roy CK, Schneider KA (2013) Answering questions about unanswered questions of stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pp 97–100
Bacchelli A, Dal Sasso T, D’Ambros M, Lanza M (2012) Content classification of development emails. In: Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pp 375–385
Bacchelli A, Ponzanelli L, Lanza M (2012) Harnessing stack overflow for the IDE. In: Proceedings of the 3rd International Workshop on Recommendation Systems for Software Engineering, RSSE ’12, pp 26–30
Baccianella S, Esuli A, Sebastiani F (2010) Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th conference on International Language Resources and Evaluation, LREC ’10, pp 2200–2204
Bajaj K, Pattabiraman K, Mesbah A (2014) Mining questions asked by web developers. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR ’14, pp 112–121
Baltadzhieva A, Chrupala G (2015) Predicting the quality of questions on stackoverflow. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP ’15, pp 32–40
Barua A, Thomas SW, Hassan AE (2014) What are developers talking about? an analysis of topics and trends in stack overflow. Empir Softw Eng 19(3):619–654
Bazelli B, Hindle A, Stroulia E (2013) On the personality traits of stackoverflow users. In: Proceedings of the 29th IEEE International Conference on Software Maintenance, ICSM ’13, pp 460–463
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29(4):1165–1188
Bilotti MW, Katz B, Lin J (2004) What works better for question answering: stemming or morphological query expansion?. In: Proceedings of the Information Retrieval for Question Answering Workshop, IR4QA ’ 04, pp 1–3
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Bonacich P, Lloyd P (2001) Eigenvector-like measures of centrality for asymmetric relations. Soc Networks 23(3):191–201
Calefato F, Lanubile F, Maiorano F, Novielli N (2018) Sentiment polarity detection for software development. Empir Softw Eng 23(3):1352–1382
Chen C, Gao S, Xing Z (2016) Mining analogical libraries in Q&A discussions – incorporating relational and categorical knowledge into word embedding. In: Proceedings of the 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER ’16, pp 338–348
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46
Coleman M, Liau TL (1975) A computer readability formula designed for machine scoring. J Appl Psychol 60:283–284
Correa D, Sureka A (2013) Fit or unfit: analysis and prediction of ’closed questions’ on stack overflow. In: Proceedings of the International Conference on Online Social Networks, COSN ’13, pp 201–212
Correa D, Sureka A (2014) Chaff from the wheat: characterization and modeling of deleted questions on stack overflow. In: Proceedings of the 23rd International Conference on World Wide Web, WWW ’14, pp 631–642
Ding S, Cong G, Lin C-Y, Zhu X (2008) Using conditional random fields to extract contexts and answers of questions from online forums. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL ’08, pp 710–718
Fan Y, Xia X, Lo D, Hassan AE (2018) Chaff from the wheat: characterizing and determining valid bug reports. IEEE Trans Softw Eng, pp 1–30
Flesch R (1948) A new readability yardstick. J Appl Psychol 32(3):221
Garcia D, Zanetti MS, Schweitzer F (2014) The role of emotions in contributors activity: a case study on the GENTOO community. In: Proceedings of the International Conference on Cloud and Green Computing, CGC ’13, pp 410–417
Grant S, Cordy JR (2010) Estimating the optimal number of latent concepts in source code analysis. In: Proceedings of the 10th IEEE Working Conference on Source Code Analysis and Manipulation, SCAM ’10, pp 65–74
Gunning R (1952) The technique of clear writing
Guzman E, Azócar D, Li Y (2014) Sentiment analysis of commit comments in github: an empirical study. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR ’14, pp 352–355
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations Newsletter 11(1):10–18
Hanrahan BV, Convertino G, Nelson L (2012) Modeling problem difficulty and expertise in stackoverflow. In: Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work Companion, CSCW ’12, pp 91–94
Hou D, Li L (2011) Obstacles in using frameworks and APIs: an exploratory study of programmers’ newsgroup discussions. In: Proceedings of the 19th International Conference on Program Comprehension, ICPC ’11, pp 91–100
Islam MR, Zibran MF (2017) Leveraging automated sentiment analysis in software engineering. In: Proceedings of the 14th International Conference on Mining Software Repositories, MSR ’17, pp 203–214
Kincaid RLRJP, Fishburne RP Jr, Chissom BS (1975) Derivation of new readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Technical report
Jongeling R, Sarkar P, Datta S, Serebrenik A (2017) On negative results when using sentiment analysis tools for software engineering research. Empir Softw Eng 22(5):2543–2584
Laughlin GHM (1969) SMOG grading-a new readability formula. J Read 12 (8):639–646
Linares-Vásquez M, Bavota G, Di Penta M, Oliveto R, Poshyvanyk D (2014) How Do API changes trigger stack overflow discussions? a study on the android SDK. In: Proceedings of the 22nd International Conference on Program Comprehension, ICPC ’14, pp 83–94
Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014) The stanford coreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics, ACL ’12, pp 55–60
Mccallum AK (2002) Mallet: A machine learning for language toolkit
McIntosh S, Kamei Y, Adams B, Hassan AE (2016) An empirical study of the impact of modern code review practices on software quality. Empir Softw Eng 21 (5):2146–2189
Neuhäuser M (2011) Wilcoxon–Mann–Whitney Test. International encyclopedia of statistical science, pp 1656–1658
Novielli N, Calefato F, Lanubile F (2015) The challenges of sentiment detection in the social programmer ecosystem. In: Proceedings of the 7th International Workshop on Social Software Engineering, SSE ’15, pp 33–40
Novielli N, Girardi D, Lanubile F (2018) A benchmark study on sentiment analysis for software engineering research. In: Proceedings of the 15th International Conference on Mining Software Repositories, MSR ’18, pp 364–375
Ortu M, Destefanis G, Kassab M, Counsell S, Marchesi M, Tonelli R (2015) Would you mind fixing this issue?. In: Proceedings of the 15th International Conference on Agile Software Development, XP ’15, pp 129–140
Panichella S, Sorbo AD, Guzman E, Visaggio CA, Canfora G, Gall HC (2015) How can i improve my app? classifying user reviews for software maintenance and evolution. Inproceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME ’15, pp 281–290
Petrosyan G, Robillard MP, De Mori R (2015) Discovering information explaining API types using text classification. In: Proceedings of the 37th International Conference on Software Engineering, ICSE ’15, pp 869–879
Pletea D, Vasilescu B, Serebrenik A (2014) Security and emotion: sentiment analysis of security discussions on GitHub. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR ’14, pp 348–351
Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the ide into a self-confident programming prompter. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR ’14, pp 102–111
Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014) Improving low quality stack overflow post detection. In: Proceedings of the 30th International Conference on Software Maintenance and Evolution, ICSME ’12, pp 541–544
Raghavan P, Catherine R, Ikbal S, Kambhatla N, Majumdar D (2010) Extracting problem and resolution information from online discussion forums. In: Proceedings of the 16th International Conference on Management of Data, COMAD ’10, pp 77
Robbes R, Lungu M, Röthlisberger D (2012) How do developers react to API deprecation?: the case of a smalltalk ecosystem. In: Proceedings of the 20th International Symposium on the Foundations of Software Engineering, FSE ’12, pp 56:1–56:11
Robillard MP (2009) What makes APIs hard to learn? answers from developers. IEEE Softw 26(6):27–34
Robillard MP, Deline R (2011) A field study of API learning obstacles. Empir Softw Eng 16(6):703–732
Romano J, Kromrey JD, Coraggio J, Skowronek J, Devine L (2006) Exploring methods for evaluating group differences on the NSSE and other surveys: are the T-test and Cohen’sd indices the most appropriate choices. In: Proceeding of the Annual Meeting of the Southern Association for Institutional Research, pp 1–51
Sandor A, Lagos N, Vo N-P-A, Brun C (2016) Identifying user issues and request types in forum question posts based on discourse analysis. In: Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16 Companion, pp 685–691
Silfverberg M, Ruokolainen T, Lindén K, Kurimo M (2014) Part-of-speech tagging using conditional random fields: exploiting sub-label dependencies for improved accuracy. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL ’14, pp 259–264
Sinha V, Lazar A, Sharif B (2016) Analyzing developer sentiment in commit logs. In: Proceedings of the 13th International Conference on Mining Software Repositories, MSR ’16, pp 520–523
Sutton C, McCallum A (2012) An introduction to conditional random fields. Foundations and Trends in Machine Learning 4(4):267–373
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL ’03, pp 173–180
Treude C, Barzilay O, Storey M-A (2011) How do programmers ask and answer questions on the Web?. In: Proceedings of 33rd International Conference on Software Engineering, ICSE ’11, pp 804–807
Uddin G, Khomh F (2017) Automatic summarization of API reviews. In: Proceedings of the 32nd International Conference on Automated Software Engineering, ASE ’17, pp 159–170
Wallach HM, Murray I, Salakhutdinov R, Mimno D (2009) Evaluation Methods for Topic Models. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp 1105–1112
Wang H, Wang C, Zhai C, Han J (2011) Learning online discussion structures by conditional random fields. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pp 435–444
Wang W, Godfrey MW (2013) Detecting API usage obstacles: a study of iOS and android developer questions. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pp 61–64
Wang W, Malik H, Godfrey MW (2015) Recommending posts concerning API issues in developer Q&A sites. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, pp 224–234
Zanetti MS, Scholtes I, Tessone CJ, Schweitzer F (2013) Categorizing bugs with social networks: a case study on four open source software communities. In: Proceedings of the 35th International Conference on Software Engineering, ICSE ’13, pp 1032–1041
Zhang Y, Hou D (2013) Extracting problematic API features from forum discussions. In: Proceedings of the 21st International Conference on Program Comprehension, ICPC ’13, pp 142–151
Zibran MF, Eishita FZ, Roy CK (2011) Useful, but usable? factors affecting the usability of APIs. In: Proceedings of the 18th Working Conference on Reverse Engineering, WCRE ’11, pp 151–155
Zimmermann T, Premraj R, Bettenburg N, Just S, Schroter A, Weiss C (2010) What makes a good bug report? IEEE Trans Softw Eng 36(5):618–643
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Massimiliano Di Penta and David Shepherd
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Software Analysis, Evolution and Reengineering (SANER)
Rights and permissions
About this article
Cite this article
Ahasanuzzaman, M., Asaduzzaman, M., Roy, C.K. et al. CAPS: a supervised technique for classifying Stack Overflow posts concerning API issues. Empir Software Eng 25, 1493–1532 (2020). https://doi.org/10.1007/s10664-019-09743-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-019-09743-4