Abstract
Developers often use learning resources such as API tutorials and Stack Overflow (SO) to learn how to use an unfamiliar API. An API tutorial can be divided into a number of consecutive units that describe the same topic, denoted as tutorial fragments. We consider a tutorial fragment explaining the API usage knowledge as a relevant fragment of the API. Discovering relevant tutorial fragments of APIs can facilitate API understanding, learning, and application. However, existing approaches, based on supervised or unsupervised approaches, often suffer from either high manual efforts or lack of consideration of the relevance information. In this paper, we propose a novel approach, called SO2RT, to detect relevant tutorial fragments of APIs based on SO posts. SO2RT first automatically extracts relevant and irrelevant \(\left \langle API, QA \right \rangle \) pairs (QA stands for question and answer) and \(\left \langle API, FRA \right \rangle \) pairs (FRA stands for tutorial fragment). It then trains a semi-supervised transfer learning based detection model, which can transfer the API usage knowledge in SO Q&A pairs to tutorial fragments by utilizing the easy-to-extract \(\left \langle API, QA \right \rangle \) pairs. Finally, relevant fragments of APIs can be discovered by consulting the trained model. In this way, the effort for labeling the relevance between tutorial fragments and APIs can be reduced. We evaluate SO2RT on Java and Android datasets containing 21,008 \(\left \langle API, QA \right \rangle \) pairs. Experimental results show that SO2RT improves the state-of-the-art approaches in terms of F-Measure on both datasets. Our user study further confirms the effectiveness of SO2RT in practice. We also show a successful application of the relevant fragments to API recommendation.
Similar content being viewed by others
Data Availability
Our tool, experimental data, and results are publicly available at: https://sites.google.com/site/stcaso2rt.
Notes
References
How to add one day to a date? (2018a). https://stackoverflow.com/questions/1005523/
Joda time - add weekdays to date (2018b). https://stackoverflow.com/questions/12728527/
JodaTime tutorial Construction (2018c). https://www.joda.org/joda-time/userguide.html#Construction
Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2020) CAPS: a supervised technique for classifying stack overflow posts concerning API issues. Empir Softw Eng 25(2):1493–1532
Asaduzzaman M, Mashiyat AS, Roy CK, Schneider KA (2013) Answering questions about unanswered questions of stack overflow. In: Working conference on mining software repositories, pp 97–100
Bao L, Xing Z, Xia X, Lo D, Wu M, Yang X (2020) psc2code: Denoising code extraction from programming screencasts. ACM Trans Softw Eng Methodol 29(3):1–38
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434
Bulmer MG (1979) Principles of statistics. Courier Corporation, Massachusetts
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chowdhury SA, Hindle A (2015) Mining stackoverflow to filter out off-topic IRC discussion. In: Working conference on mining software repositories, pp 422–425
Cliff N (2014) Ordinal methods for behavioral data analysis. Psychology Press, United Kingdom
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9(Aug):1871–1874
Gao Z, Xia X, Lo D, Grundy J (2020) Technical q8a site answer recommendation via question boosting. ACM Trans Softw Eng Methodol 30(1):1–34
Gretton A, Bousquet O, Smola AJ, Schölkopf B (2005) Measuring statistical dependence with hilbert-schmidt norms. In: Algorithmic learning theory, pp 63–77
Gu X, Zhang H, Zhang D, Kim S (2016) Deep API learning. In: International symposium on foundations of software engineering, pp 631–642
Gu X, Zhang H, Kim S (2018) Deep code search. In: International conference on software engineering, pp 933–944
Huang Q, Xia X, Lo D (2017) Supervised vs unsupervised models: a holistic look at effort-aware just-in-time defect prediction. In: International conference on software maintenance and evolution, pp 159–170
Huang Q, Xia X, Xing Z, Lo D, Wang X (2018) API method recommendation without worrying about the task-api knowledge gap. In: International conference on automated software engineering, pp 293–304
Jiang H, Zhang J, Li X, Ren Z, Lo D (2016) A more accurate model for finding tutorial segments explaining APIs. In: International conference on software analysis, evolution, and reengineering, pp 157–167
Jiang H, Zhang J, Ren Z, Zhang T (2017) An unsupervised approach for discovering relevant tutorial fragments for apis. In: International conference on software engineering, pp 38–48
Jiang M, Huang W, Huang Z, Yen GG (2017) Integration of global and local metrics for domain adaptation learning via dimensionality reduction. IEEE Trans Cybern 47(1):38–51
Jing X, Wu F, Dong X, Xu B (2017) An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans Softw Eng 43(4):321–339
Joachims T (1999) Transductive inference for text classification using support vector machines. Icml 99:200–209
Kittler J, Hater M, Duin RP (1996) Combining classifiers. In: International conference on pattern recognition, vol 2, pp 897–901
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159–174
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
Li J, Xing Z, Kabir M A (2018) Leveraging official content and social context to recommend software documentation. IEEE Trans Serv Comput 14(2):472–486
Li X, Fang M, Zhang J J, Wu J (2017) Domain adaptation from rgb-d to rgb images. Signal Process 131:27–35
Li X, Jiang H, Kamei Y, Chen X (2020a) Bridging semantic gaps between natural languages and apis with word embedding. IEEE Trans Softw Eng 46(10):1081–1097
Li Y, Sheng H, Cheng Y, Stroe D I, Teodorescu R (2020b) State-of-health estimation of lithium-ion batteries based on semi-supervised transfer component analysis. Appl Energy 115504:277
Lin B, Zampetti F, Bavota G, Penta MD, Lanza M (2019) Pattern-based mining of opinions in q&a websites. In: International conference on software engineering, pp 548–559
Lin Z, Zou Y, Zhao J, Xie B (2017) Improving software text retrieval using conceptual knowledge in source code. In: International conference on automated software engineering, pp 123–134
Ma S, Xing Z, Chen C, Chen C, Qu L, Li G (2021) Easy-to-deploy api extraction by multi-level feature embedding and transfer learning. IEEE Trans Softw Eng 47(10):2296–2311
Maalej W, Robillard MP (2013) Patterns of knowledge in API reference documentation. IEEE Trans Softw Eng 39(9):1264–1282
Manning CD (2008) Introduction to information retrieval. Cambridge University Press, United Kingdom
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Annual conference on neural information processing systems, pp 3111–3119
Nguyen TD, Nguyen AT, Phan HD, Nguyen TN (2017) Exploring API embedding for API usages and applications. In: International conference on software engineering, pp 438–449
Nguyen TV, Tran NM, Phan H, Nguyen TD, Truong LH, Nguyen AT, Nguyen HA, Nguyen TN (2018) Complementing global and local contexts in representing API descriptions to improve API retrieval tasks. In: Joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp 551–562
Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: IJCAI-99 workshop on machine learning for information filtering, pp 61–67
Pan SJ, Tsang IW, Kwok JT, Yang Q (2009) Domain adaptation via transfer component analysis. In: International joint conference on artificial intelligence, pp 1187–1192
Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210
Parnin C, Treude C, Grammel L, Storey MA (2012) Crowd documentation: Exploring the coverage and the dynamics of api discussions on stack overflow. Georgia Institute of Technology, Tech Rep
Petrosyan G, Robillard MP, De Mori R (2015) Discovering information explaining api types using text classification. In: International conference on software engineering, pp 869–879
Ponzanelli L, Bavota G, Mocci A, Oliveto R, Penta M D, Haiduc S, Russo B, Lanza M (2019) Automatic identification and classification of software development video tutorial fragments. IEEE Trans Softw Eng 45(5):464–488
Raghothaman M, Wei Y, Hamadi Y (2016) SWIM: synthesizing what i mean: code search and idiomatic snippet synthesis. In: International conference on software engineering, pp 357–367
Rahman MM, Roy CK (2015) An insight into the unresolved questions at stack overflow. In: Working conference on mining software repositories, pp 426–429
Rahman MM, Roy CK, Lo D (2016) RACK: automatic API recommendation using crowdsourced knowledge. In: International conference on software analysis, evolution, and reengineering, pp 349–359
Rahman MM, Roy CK, Lo D (2017) RACK: code search in the IDE using crowdsourced knowledge. In: International conference on software engineering, pp 51–54
Rehurek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC. Workshop on New Challenges for NLP Frameworks, Citeseer, p 2010
Robillard M P (2009) What makes apis hard to learn? answers from developers. IEEE Softw 26(6):27–34
Robillard MP, Chhetri YB (2015) Recommending reference API documentation. Empir Softw Eng 20(6):1558–1586
Robillard MP, DeLine R (2011) A field study of API learning obstacles. Empir Softw Eng 16(6):703–732
Rubei R, Di Sipio C, Nguyen PT, Di Rocco J, Di Ruscio D (2020) Postfinder: mining stack overflow posts to support software developers. Inf Softw Technol 106367:127
Smola AJ, Gretton A, Song L, Schölkopf B (2007) A hilbert space embedding for distributions. In: Algorithmic learning theory, 18th international conference, pp 13–31
Subramanian S, Inozemtseva L, Holmes R (2014) Live API documentation. In: International conference on software engineering, pp 643–652
Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C (2018) A survey on deep transfer learning. CoRR:1808.01974
Thung F, Wang S, Lo D, Lawall J (2013) Automatic recommendation of api methods from feature requests. In: International conference on automated software engineering, pp 290–300
Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3):659–665
Treude C, Robillard MP (2016) Augmenting API documentation with insights from stack overflow. In: International conference on software engineering, pp 392–403
Treude C, Barzilay O, Storey MD (2011) How do programmers ask and answer questions on the web?. In: International conference on software engineering, pp 804–807
Treude C, Robillard M P, Dagenais B (2015) Extracting development tasks to navigate software documentation. IEEE Trans Softw Eng 41(6):565–581
Uddin G, Khomh F, Roy CK (2020) Mining api usage scenarios from stack overflow. Inf Softw Technol 122:106277
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics bulletin 1(6):80–83
Wu D, Jing XY, Chen H, Zhu X, Zhang H, Zuo M, Zi L, Zhu C (2018) Automatically answering api-related questions. In: International conference on software engineering: companion proceeedings, pp 270–271
Wu D, Jing XY, Zhang H, Kong X, Xie Y, Huang Z (2020) Data-driven approach to application programming interface documentation mining: a review. Wiley Interdiscip Rev Data Min Knowl Disc 10(5):e1369
Wu D, Jing X, Zhang H, Li B, Xie Y, Xu B (2021a) Generating API tags for tutorial fragments from stack overflow. Empir Softw Eng 26(4):66
Wu D, Jing XY, Zhang H, Zhou Y, Xu B (2021b) Leveraging Stack Overflow to detect relevant tutorial fragments of apis. In: International conference on software analysis, evolution and reengineering, pp 35–46
Xie W, Peng X, Liu M, Treude C, Xing Z, Zhang X, Zhao W, Zimmermann T (2020) API method recommendation via explicit matching of functionality verb phrases. In: Devanbu P, Cohen MB (eds) Joint european software engineering conference and symposium on the foundations of software engineering, pp 1015–1026
Xu B, Ye D, Xing Z, Xia X, Chen G, Li S (2016) Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In: International conference on automated software engineering, pp 51–62
Xu B, Xing Z, Xia X, Lo D (2017) Answerbot: automated generation of answer summary to developers’ technical questions. In: International conference on automated software engineering, pp 706–716
Ye X, Shen H, Ma X, Bunescu RC, Liu C (2016) From word embeddings to document similarities for improved information retrieval in software engineering. In: International conference on software engineering, pp 404–415
Zhang F, Niu H, Keivanloo I, Zou Y (2018) Expanding queries for code search using semantically related API class-names. IEEE Trans Softw Eng 44(11):1070–1082
Zhang J, Jiang H, Ren Z, Zhang T, Huang Z (2021) Enriching api documentation with code samples and usage scenarios from crowd knowledge. IEEE Trans Softw Eng 47(6):1299–1314
Zhang N, Huang Q, Xia X, Zou Y, Lo D, Xing Z (2020) Chatbot4qr: interactive query refinement for technical question retrieval. IEEE Trans Softw Eng 48(4):1185–1211
Zhao D, Xing Z, Chen C, Xia X, Li G (2019) Actionnet: vision-based workflow action recognition from programming screencasts. In: International conference on software engineering, pp 350–361
Acknowledgements
The authors would like to thank the anonymous reviewers for their constructive comments and suggestions. This work was supported by the NSFC Project under Grant No. 62176069 and 61933013, the Innovation Group of Guangdong Education Department under Grant No. 2020KCXTD014, the 2019 Key Discipline project of Guangdong Province, and Jiangsu Funding Program for Excellent Postdoctoral Talent No. 20220ZB43. Hongyu Zhang is supported by Australian Research Council (ARC) Discovery Project DP220103044.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Christoph Treude
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, D., Jing, XY., Zhang, H. et al. Leveraging Stack Overflow to detect relevant tutorial fragments of APIs. Empir Software Eng 28, 12 (2023). https://doi.org/10.1007/s10664-022-10235-1
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-022-10235-1