Skip to main content
Log in

CAPS: a supervised technique for classifying Stack Overflow posts concerning API issues

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

The design and maintenance of APIs (Application Programming Interfaces) are complex tasks due to the constantly changing requirements of their users. Despite the efforts of their designers, APIs may suffer from a number of issues (such as incomplete or erroneous documentation, poor performance, and backward incompatibility). To maintain a healthy client base, API designers must learn these issues to fix them. Question answering sites, such as Stack Overflow (SO), have become a popular place for discussing API issues. These posts about API issues are invaluable to API designers, not only because they can help to learn more about the problem but also because they can facilitate learning the requirements of API users. However, the unstructured nature of posts and the abundance of non-issue posts make the task of detecting SO posts concerning API issues difficult and challenging. In this paper, we first develop a supervised learning approach using a Conditional Random Field (CRF), a statistical modeling method, to identify API issue-related sentences. We use the above information together with different features collected from posts, the experience of users, readability metrics and centrality measures of collaboration network to build a technique, called CAPS, that can classify SO posts concerning API issues. In total, we consider 34 features along eight different dimensions. Evaluation of CAPS using carefully curated SO posts on three popular API types reveals that the technique outperforms all three baseline approaches we consider in this study. We then conduct studies to find important features and also evaluate the performance of the CRF-based technique for classifying issue sentences. Comparison with two other baseline approaches shows that the technique has high potential. We also test the generalizability of CAPS results, evaluate the effectiveness of different classifiers, and identify the impact of different feature sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. https://answers.yahoo.com/

  2. https://www.quora.com/

  3. http://stackoverflow.com/questions/25436505/

  4. https://github.com/junit-team/junit4/issues/1083

  5. https://issuetracker.google.com/issues/36979732

  6. https://developer.android.com

  7. https://jenkins.io

  8. https://neo4j.com

  9. https://archive.org/details/stackexchange

  10. https://issuetracker.google.com/issues?q=componentid:190923%2B

  11. https://github.com/neo4j/neo4j/issues

  12. https://issues.jenkins-ci.org/browse/JENKINS-56165?jql=project%20%3D%20JENKINS

  13. https://stackoverflow.com/help/whats-reputation

References

  • Aggarwal K, Timbers F, Rutgers T, Hindle A, Stroulia E, Greiner R (2017) Detecting duplicate bug reports with software engineering domain knowledge. Journal of Software: Evolution and Process 29(3):e1821

    Google Scholar 

  • Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2018) Classifying stack overflow posts on API issues. In: Proceedings of the 25th international conference on software analysis, evolution and reengineering, SANER ’18, pp 244–254

  • Ahmed T, Bosu A, Iqbal A, Rahimi S (2017) SentiCR: a customized sentiment analysis tool for code review interactions. In: Proceedings of the 32nd International Conference on Automated Software Engineering, ASE ’17, pp 106–111

  • Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2018) Classifying Stack Overflow Posts on API Issues. In Proceedings of the 25th International Conference on Software Analysis, Evolution and Reengineering, SANER ?18, pages 244–254

  • Allison PD (2012) Logistic Regression Using SAS: Theory and Application. 2nd edn

  • Asaduzzaman M, Mashiyat AS, Roy CK, Schneider KA (2013) Answering questions about unanswered questions of stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pp 97–100

  • Bacchelli A, Dal Sasso T, D’Ambros M, Lanza M (2012) Content classification of development emails. In: Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pp 375–385

  • Bacchelli A, Ponzanelli L, Lanza M (2012) Harnessing stack overflow for the IDE. In: Proceedings of the 3rd International Workshop on Recommendation Systems for Software Engineering, RSSE ’12, pp 26–30

  • Baccianella S, Esuli A, Sebastiani F (2010) Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th conference on International Language Resources and Evaluation, LREC ’10, pp 2200–2204

  • Bajaj K, Pattabiraman K, Mesbah A (2014) Mining questions asked by web developers. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR ’14, pp 112–121

  • Baltadzhieva A, Chrupala G (2015) Predicting the quality of questions on stackoverflow. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP ’15, pp 32–40

  • Barua A, Thomas SW, Hassan AE (2014) What are developers talking about? an analysis of topics and trends in stack overflow. Empir Softw Eng 19(3):619–654

    Article  Google Scholar 

  • Bazelli B, Hindle A, Stroulia E (2013) On the personality traits of stackoverflow users. In: Proceedings of the 29th IEEE International Conference on Software Maintenance, ICSM ’13, pp 460–463

  • Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29(4):1165–1188

    Article  MathSciNet  Google Scholar 

  • Bilotti MW, Katz B, Lin J (2004) What works better for question answering: stemming or morphological query expansion?. In: Proceedings of the Information Retrieval for Question Answering Workshop, IR4QA ’ 04, pp 1–3

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Bonacich P, Lloyd P (2001) Eigenvector-like measures of centrality for asymmetric relations. Soc Networks 23(3):191–201

    Article  Google Scholar 

  • Calefato F, Lanubile F, Maiorano F, Novielli N (2018) Sentiment polarity detection for software development. Empir Softw Eng 23(3):1352–1382

    Article  Google Scholar 

  • Chen C, Gao S, Xing Z (2016) Mining analogical libraries in Q&A discussions – incorporating relational and categorical knowledge into word embedding. In: Proceedings of the 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER ’16, pp 338–348

  • Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46

    Article  Google Scholar 

  • Coleman M, Liau TL (1975) A computer readability formula designed for machine scoring. J Appl Psychol 60:283–284

    Article  Google Scholar 

  • Correa D, Sureka A (2013) Fit or unfit: analysis and prediction of ’closed questions’ on stack overflow. In: Proceedings of the International Conference on Online Social Networks, COSN ’13, pp 201–212

  • Correa D, Sureka A (2014) Chaff from the wheat: characterization and modeling of deleted questions on stack overflow. In: Proceedings of the 23rd International Conference on World Wide Web, WWW ’14, pp 631–642

  • Ding S, Cong G, Lin C-Y, Zhu X (2008) Using conditional random fields to extract contexts and answers of questions from online forums. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL ’08, pp 710–718

  • Fan Y, Xia X, Lo D, Hassan AE (2018) Chaff from the wheat: characterizing and determining valid bug reports. IEEE Trans Softw Eng, pp 1–30

  • Flesch R (1948) A new readability yardstick. J Appl Psychol 32(3):221

    Article  Google Scholar 

  • Garcia D, Zanetti MS, Schweitzer F (2014) The role of emotions in contributors activity: a case study on the GENTOO community. In: Proceedings of the International Conference on Cloud and Green Computing, CGC ’13, pp 410–417

  • Grant S, Cordy JR (2010) Estimating the optimal number of latent concepts in source code analysis. In: Proceedings of the 10th IEEE Working Conference on Source Code Analysis and Manipulation, SCAM ’10, pp 65–74

  • Gunning R (1952) The technique of clear writing

  • Guzman E, Azócar D, Li Y (2014) Sentiment analysis of commit comments in github: an empirical study. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR ’14, pp 352–355

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations Newsletter 11(1):10–18

    Article  Google Scholar 

  • Hanrahan BV, Convertino G, Nelson L (2012) Modeling problem difficulty and expertise in stackoverflow. In: Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work Companion, CSCW ’12, pp 91–94

  • Hou D, Li L (2011) Obstacles in using frameworks and APIs: an exploratory study of programmers’ newsgroup discussions. In: Proceedings of the 19th International Conference on Program Comprehension, ICPC ’11, pp 91–100

  • Islam MR, Zibran MF (2017) Leveraging automated sentiment analysis in software engineering. In: Proceedings of the 14th International Conference on Mining Software Repositories, MSR ’17, pp 203–214

  • Kincaid RLRJP, Fishburne RP Jr, Chissom BS (1975) Derivation of new readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Technical report

  • Jongeling R, Sarkar P, Datta S, Serebrenik A (2017) On negative results when using sentiment analysis tools for software engineering research. Empir Softw Eng 22(5):2543–2584

    Article  Google Scholar 

  • Laughlin GHM (1969) SMOG grading-a new readability formula. J Read 12 (8):639–646

    Google Scholar 

  • Linares-Vásquez M, Bavota G, Di Penta M, Oliveto R, Poshyvanyk D (2014) How Do API changes trigger stack overflow discussions? a study on the android SDK. In: Proceedings of the 22nd International Conference on Program Comprehension, ICPC ’14, pp 83–94

  • Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014) The stanford coreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics, ACL ’12, pp 55–60

  • Mccallum AK (2002) Mallet: A machine learning for language toolkit

  • McIntosh S, Kamei Y, Adams B, Hassan AE (2016) An empirical study of the impact of modern code review practices on software quality. Empir Softw Eng 21 (5):2146–2189

    Article  Google Scholar 

  • Neuhäuser M (2011) Wilcoxon–Mann–Whitney Test. International encyclopedia of statistical science, pp 1656–1658

    Chapter  Google Scholar 

  • Novielli N, Calefato F, Lanubile F (2015) The challenges of sentiment detection in the social programmer ecosystem. In: Proceedings of the 7th International Workshop on Social Software Engineering, SSE ’15, pp 33–40

  • Novielli N, Girardi D, Lanubile F (2018) A benchmark study on sentiment analysis for software engineering research. In: Proceedings of the 15th International Conference on Mining Software Repositories, MSR ’18, pp 364–375

  • Ortu M, Destefanis G, Kassab M, Counsell S, Marchesi M, Tonelli R (2015) Would you mind fixing this issue?. In: Proceedings of the 15th International Conference on Agile Software Development, XP ’15, pp 129–140

    Google Scholar 

  • Panichella S, Sorbo AD, Guzman E, Visaggio CA, Canfora G, Gall HC (2015) How can i improve my app? classifying user reviews for software maintenance and evolution. Inproceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME ’15, pp 281–290

  • Petrosyan G, Robillard MP, De Mori R (2015) Discovering information explaining API types using text classification. In: Proceedings of the 37th International Conference on Software Engineering, ICSE ’15, pp 869–879

  • Pletea D, Vasilescu B, Serebrenik A (2014) Security and emotion: sentiment analysis of security discussions on GitHub. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR ’14, pp 348–351

  • Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the ide into a self-confident programming prompter. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR ’14, pp 102–111

  • Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014) Improving low quality stack overflow post detection. In: Proceedings of the 30th International Conference on Software Maintenance and Evolution, ICSME ’12, pp 541–544

  • Raghavan P, Catherine R, Ikbal S, Kambhatla N, Majumdar D (2010) Extracting problem and resolution information from online discussion forums. In: Proceedings of the 16th International Conference on Management of Data, COMAD ’10, pp 77

  • Robbes R, Lungu M, Röthlisberger D (2012) How do developers react to API deprecation?: the case of a smalltalk ecosystem. In: Proceedings of the 20th International Symposium on the Foundations of Software Engineering, FSE ’12, pp 56:1–56:11

  • Robillard MP (2009) What makes APIs hard to learn? answers from developers. IEEE Softw 26(6):27–34

    Article  Google Scholar 

  • Robillard MP, Deline R (2011) A field study of API learning obstacles. Empir Softw Eng 16(6):703–732

    Article  Google Scholar 

  • Romano J, Kromrey JD, Coraggio J, Skowronek J, Devine L (2006) Exploring methods for evaluating group differences on the NSSE and other surveys: are the T-test and Cohen’sd indices the most appropriate choices. In: Proceeding of the Annual Meeting of the Southern Association for Institutional Research, pp 1–51

  • Sandor A, Lagos N, Vo N-P-A, Brun C (2016) Identifying user issues and request types in forum question posts based on discourse analysis. In: Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16 Companion, pp 685–691

  • Silfverberg M, Ruokolainen T, Lindén K, Kurimo M (2014) Part-of-speech tagging using conditional random fields: exploiting sub-label dependencies for improved accuracy. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL ’14, pp 259–264

  • Sinha V, Lazar A, Sharif B (2016) Analyzing developer sentiment in commit logs. In: Proceedings of the 13th International Conference on Mining Software Repositories, MSR ’16, pp 520–523

  • Sutton C, McCallum A (2012) An introduction to conditional random fields. Foundations and Trends in Machine Learning 4(4):267–373

    Article  Google Scholar 

  • Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL ’03, pp 173–180

  • Treude C, Barzilay O, Storey M-A (2011) How do programmers ask and answer questions on the Web?. In: Proceedings of 33rd International Conference on Software Engineering, ICSE ’11, pp 804–807

  • Uddin G, Khomh F (2017) Automatic summarization of API reviews. In: Proceedings of the 32nd International Conference on Automated Software Engineering, ASE ’17, pp 159–170

  • Wallach HM, Murray I, Salakhutdinov R, Mimno D (2009) Evaluation Methods for Topic Models. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp 1105–1112

  • Wang H, Wang C, Zhai C, Han J (2011) Learning online discussion structures by conditional random fields. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pp 435–444

  • Wang W, Godfrey MW (2013) Detecting API usage obstacles: a study of iOS and android developer questions. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pp 61–64

  • Wang W, Malik H, Godfrey MW (2015) Recommending posts concerning API issues in developer Q&A sites. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, pp 224–234

  • Zanetti MS, Scholtes I, Tessone CJ, Schweitzer F (2013) Categorizing bugs with social networks: a case study on four open source software communities. In: Proceedings of the 35th International Conference on Software Engineering, ICSE ’13, pp 1032–1041

  • Zhang Y, Hou D (2013) Extracting problematic API features from forum discussions. In: Proceedings of the 21st International Conference on Program Comprehension, ICPC ’13, pp 142–151

  • Zibran MF, Eishita FZ, Roy CK (2011) Useful, but usable? factors affecting the usability of APIs. In: Proceedings of the 18th Working Conference on Reverse Engineering, WCRE ’11, pp 151–155

  • Zimmermann T, Premraj R, Bettenburg N, Just S, Schroter A, Weiss C (2010) What makes a good bug report? IEEE Trans Softw Eng 36(5):618–643

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md Ahasanuzzaman.

Additional information

Communicated by: Massimiliano Di Penta and David Shepherd

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Software Analysis, Evolution and Reengineering (SANER)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahasanuzzaman, M., Asaduzzaman, M., Roy, C.K. et al. CAPS: a supervised technique for classifying Stack Overflow posts concerning API issues. Empir Software Eng 25, 1493–1532 (2020). https://doi.org/10.1007/s10664-019-09743-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-019-09743-4

Keywords

Navigation