Abstract
API tutorials are important learning resources as they explain how to use certain APIs in a given programming context. An API tutorial can be split into a number of units. Consecutive units that describe a same topic are often called a tutorial fragment. We consider the API explained by a tutorial fragment as an API tag. Generating API tags for a tutorial fragment can help understand, navigate, and retrieve the fragment. Existing approaches often do not perform well on API tag generation due to high manual effort and low accuracy. Like API tutorials, Stack Overflow (SO) is also an important learning resource that provides the explanations of APIs. Thus, SO posts also contain API tags. Besides, API tags of SO posts are abundant and can be extracted easily. In this paper, we propose a novel approach ATTACK (stands for A PI T ag for T utorial frA gments using C rowd K nowledge), which can automatically generate API tags for tutorial fragments from SO posts. ATTACK first constructs \(\left \langle Q\&A\ pair, tag\ set \right \rangle \) pairs by extracting API tags of SO posts. Then, it trains a deep neural network with the attention mechanism to learn the semantic relatedness between Q&A pairs and the associated API tags, taking into consideration both textual descriptions and code in a Q&A pair. Finally, the trained model is used to generate API tags for tutorial fragments. We evaluate ATTACK on public Java and Android datasets containing 43,132 \(\left \langle Q\&A\ pair, tag\ set \right \rangle \) pairs. Experimental results show that ATTACK is effective and outperforms the state-of-the-art approaches in terms of F-Measure. Our user study further confirms the effectiveness of ATTACK in generating API tags for tutorial fragments. We also apply ATTACK to document linking and the results confirm the usefulness of API tags generated by ATTACK.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10664-021-09962-8/MediaObjects/10664_2021_9962_Fig13_HTML.png)
Similar content being viewed by others
Notes
A positive instance refers to a Q&A pair/fragment and its an API tag, while a negative instance refers to a Q&A pair/fragment and a supporting API.
An API tag refers to APIs explained by a tutorial fragment/Q&A pair, while a SO tag is the question-related keyword such as programming languages (e.g., java, android), libraries (e.g., jodatime, graphics), or APIs (e.g., DateTime, Canvas).
References
(2017) Stack overflow’s public data dump. https://archive.org/download/stackexchange
(2018) Android specification. https://developer.android.com/reference/packages
(2018) The document of the API DateTimeFormat. https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html
(2018) Eclipse’s java parser. https://mvnrepository.com/artifact/org.eclipse.jdt/org.eclipse.jdt.core
(2018a) An example of so q&a pair. https://stackoverflow.com/questions/5663671/
(2018b) An example of tutorial fragment. https://stuff.mit.edu/afs/sipb/project/android/docs/guide/topics/graphics/2d-graphics.html
(2018) Java SE specification. https://www.oracle.com/technetwork/java/javase/documentation/index.html
(2018a) Jodatime specification. https://www.joda.org/joda-time/apidocs/index.html
(2018b) Jodatime tutorial. https://www.joda.org/joda-time/userguide.html
(2018) Math specification. http://commons.apache.org/proper/commons-math/javadocs/
(2018) Smack specification. http://download.igniterealtime.org/smack/docs/
(2018) Tensorflow framework. https://www.tensorflow.org
(2019) The details of WE+ATTACK. https://sites.google.com/site/attackapitags/we-attack
Aghajani E, Nagy C, Vega-Márquez OL, Linares-Vásquez M, Moreno L, Bavota G, Lanza M (2019) Software documentation issues unveiled. In: International Conference on Software Engineering, pp 1199–1210
Asaduzzaman M, Mashiyat AS, Roy CK, Schneider KA (2013) Answering questions about unanswered questions of stack overflow. In: Working Conference on Mining Software Repositories, pp 97–100
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations, 1409.0473
Bao L, Xing Z, Xia X, Lo D (2019) Vt-revolution: Interactive programming video tutorial authoring and watching system. IEEE Trans Softw Eng 45 (8):823–838
Bao L, Xing Z, Xia X, Lo D, Wu M, Yang X (2020) psc2code: Denoising code extraction from programming screencasts, vol 29
Calefato F, Lanubile F, Maiorano F, Novielli N (2018) Sentiment polarity detection for software development. Empir Softw Eng 23(3):1352–1382
Chen M, Fischer F, Meng N, Wang X, Grossklags J (2019) How reliable is the crowdsourced knowledge of security implementation? In: International Conference on Software Engineering, pp 536–547
Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing, pp 1724–1734
Chowdhury SA, Hindle A (2015) Mining stackoverflow to filter out off-topic IRC discussion. In: Working Conference on Mining Software Repositories, pp 422–425
Cliff N (2014) Ordinal methods for behavioral data analysis. Psychology Press
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46
Dagenais B, Hendren LJ (2008) Enabling static analysis for partial java programs. In: Conference on Object-oriented Programming, Systems, Languages, and Applications, pp 313–328
Dagenais B, Robillard MP (2012) Recovering traceability links between an api and its learning resources. In: International Conference on Software Engineering, pp 47–57
Duala-Ekoko E, Robillard MP (2012) Asking and answering questions about unfamiliar apis: An exploratory study. In: International Conference on Software Engineering, pp 266–276
Fu W, Menzies T (2017) Revisiting unsupervised learning for defect prediction. In: Joint Meeting on Foundations of Software Engineering, pp 72–83
Gao Z, Xia X, Lo D, Grundy J (2019) Technical q & a site answer recommendation via question boosting. ACM Transactions On Software Engineering And Methodology In: Press
Gao Z, Xia X, Grundy J, Lo D, Li YF (2020) Generating question titles for stack overflow from mined code snippets. arXiv preprint arXiv:200510157
Gu X, Zhang H, Zhang D, Kim S (2016) Deep API learning. In: International Symposium on Foundations of Software Engineering, pp 631–642
Gu X, Zhang H, Zhang D, Kim S (2017) Deepam: Migrate apis with multi-modal sequence to sequence learning. In: International Joint Conference on Artificial Intelligence, pp 3675–3681
Gu X, Zhang H, Kim S (2018) Deep code search. In: International Conference on Software Engineering, pp 933–944
Hall MA, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations 11 (1):10–18
Hata H, Treude C, Kula RG, Ishio T (2019) 9.6 million links in source code comments: purpose, evolution, and decay. In: International Conference on Software Engineering, pp 1211–1221
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hu X, Li G, Xia X, Lo D, Jin Z (2018a) Deep code comment generation. In: International Conference on Program Comprehension, pp 200–210
Hu X, Li G, Xia X, Lo D, Lu S, Jin Z (2018b) Summarizing source code with transferred API knowledge. In: International Joint Conference on Artificial Intelligence, pp 2269–2275
Huang Q, Xia X, Lo D (2017) Supervised vs unsupervised models: A holistic look at effort-aware just-in-time defect prediction. In: International Conference on Software Maintenance and Evolution,, pp 159–170
Huang Q, Xia X, Xing Z, Lo D, Wang X (2018) API method recommendation without worrying about the task-api knowledge gap. In: International Conference on Automated Software Engineering, pp 293–304
Jiang H, Zhang J, Li X, Ren Z, Lo D (2016) A more accurate model for finding tutorial segments explaining APIs. In: International Conference on Software Analysis, Evolution, and Reengineering, pp 157–167
Jiang H, Zhang J, Ren Z, Zhang T (2017a) An unsupervised approach for discovering relevant tutorial fragments for apis. In: International Conference on Software Engineering, pp 38–48
Jiang S, Armaly A, McMillan C (2017b) Automatically generating commit messages from diffs using neural machine translation. In: International Conference on Automated Software Engineering, pp 135–146
Joorabchi A, English M, Mahdi AE (2015) Automatic mapping of user tags to wikipedia concepts: the case of a q&a website - stackoverflow. J Inf Sci 41(5):570–583
Kim K, Kim D, Bissyandé TF, Choi E, Li L, Klein J, Traon YL (2018) Facoy: a code-to-code search engine. In: International Conference on Software Engineering, pp 946–957
LeClair A, Jiang S, McMillan C (2019) A neural model for generating natural language summaries of program subroutines. In: International Conference on Software Engineering, pp 795–806
Li H, Li S, Sun J, Xing Z, Peng X, Liu M, Zhao X (2018) Improving api caveats accessibility by mining api caveats knowledge graph. In: International Conference on Software Maintenance and Evolution, pp 183–193
Li X, Jiang H, Kamei Y, Chen X (2018) Bridging semantic gaps between natural languages and apis with word embedding. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2018.2876006
Lin B, Zampetti F, Bavota G, Penta MD, Lanza M (2019) Pattern-based mining of opinions in q&a websites. In: International Conference on Software Engineering, pp 548–559
Lin Z, Zou Y, Zhao J, Xie B (2017) Improving software text retrieval using conceptual knowledge in source code. In: International Conference on Automated Software Engineering, pp 123–134
Liu Z, Xia X, Hassan AE, Lo D, Xing Z, Wang X (2018) Neural-machine-translation-based commit message generation: how far are we? In: International Conference on Automated Software Engineering, pp 373–384
Lucia AD, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4):13
Lv F, Zhang H, Lou J, Wang S, Zhang D, Zhao J (2015) Codehow: Effective code search based on API understanding and extended boolean model (E). In: International Conference on Automated Software Engineering, pp 260–270
Ma S, Xing Z, Chen C, Chen C, Qu L, Li G (2019) Easy-to-deploy api extraction by multi-level feature embedding and transfer learning. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2019.2946830
Maalej W, Robillard MP (2013) Patterns of knowledge in API reference documentation. IEEE Trans Softw Eng 39(9):1264–1282
Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval. Cambridge University Press
Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: Annual Meeting of the Association for Computational Linguistics, pp 55–60
Mäntylä MV, Novielli N, Lanubile F, Claes M, Kuutila M (2017) Bootstrapping a lexicon for emotional arousal in software engineering. In: International Conference on Mining Software Repositories, pp 198–202
Meyer AN, Fritz T, Murphy GC, Zimmermann T (2014) Software developers’ perceptions of productivity. In: International Symposium on Foundations of Software Engineering, pp 19–29
Nassif M, Treude C, Robillard MP (2020) Automatically categorizing software technologies. IEEE Trans Softw Eng 46(1):20–32
Nguyen AT, Nguyen TN (2015) Graph-based statistical language model for code. In: International Conference on Software Engineering, pp 858–868
Nguyen TV, Tran NM, Phan H, Nguyen TD, Truong LH, Nguyen AT, Nguyen HA, Nguyen TN (2018) Complementing global and local contexts in representing API descriptions to improve API retrieval tasks. In: Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp 551–562
Parnin C, Treude C (2011) Measuring api documentation on the web. In: International Workshop on Web 2.0 for Software Engineering, pp 25–30
Parnin C, Treude C, Grammel L, Storey MA (2012) Crowd documentation: Exploring the coverage and the dynamics of api discussions on stack overflow. Georgia Institute of Technology, Tech Rep
Parra E, Escobar-Avila J, Haiduc S (2018) Automatic tag recommendation for software development video tutorials. In: International Conference on Program Comprehension, pp 222–232
Petrosyan G, Robillard MP, De Mori R (2015) Discovering information explaining api types using text classification. In: International Conference on Software Engineering, pp 869–879
Phan H, Nguyen HA, Tran NM, Truong LH, Nguyen AT, Nguyen TN (2018) Statistical learning of API fully qualified names in code snippets of online forums. In: International Conference on Software Engineering, pp 632–642
Pletea D, Vasilescu B, Serebrenik A (2014) Security and emotion: sentiment analysis of security discussions on github. In: Working Conference on Mining Software Repositories, pp 348–351
Ponzanelli L, Bavota G, Penta MD, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the IDE into a self-confident programming prompter. In: Working Conference on Mining Software Repositories, pp 102–111
Ponzanelli L, Bavota G, Mocci A, Oliveto R, Penta MD, Haiduc S, Russo B, Lanza M (2019) Automatic identification and classification of software development video tutorial fragments. IEEE Trans Softw Eng 45(5):464–488
Rahman MM, Roy CK, Lo D (2017) RACK: code search in the IDE using crowdsourced knowledge. In: International Conference on Software Engineering, pp 51–54
Ren X, Xing Z, Xia X, Li G, Sun J (2019) Discovering, explaining and summarizing controversial discussions in community q&a sites. In: International Conference on Automated Software Engineering, pp 151–162
Robillard MP (2009) What makes apis hard to learn? answers from developers. IEEE Softw 26(6):27–34
Robillard MP, Chhetri YB (2015) Recommending reference API documentation. Empir Softw Eng 20(6):1558–1586
Robillard MP, DeLine R (2011) A field study of API learning obstacles. Empir Softw Eng 16(6):703–732
Sennrich R, Haddow B, Birch A (2016) Edinburgh neural machine translation systems for WMT 16. In: The Conference on Machine Translation, pp 371–376
Sirres R, Bissyandé TF, Kim D, Lo D, Klein J, Kim K, Traon YL (2018) Augmenting and structuring user queries to support efficient free-form code search. Empir Softw Eng 23(5):2622–2654
de Souza LBL, Campos EC, de Almeida Maia M (2014) Ranking crowd knowledge to assist software development. In: International Conference on Program Comprehension, pp 72–82
Subramanian S, Holmes R (2013) Making sense of online code snippets. In: Working Conference on Mining Software Repositories, pp 85–88
Subramanian S, Inozemtseva L, Holmes R (2014) Live API documentation. In: International Conference on Software Engineering, pp 643–652
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp 3104–3112
Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C (2018) A survey on deep transfer learning. CoRR arXiv:abs/1808.01974
Tian Y, Lo D, Lawall JL (2014) Sewordsim: software-specific word similarity database. In: International Conference on Software Engineering, pp 568–571
Treude C, Robillard MP (2016) Augmenting API documentation with insights from stack overflow. In: International Conference on Software Engineering, pp 392–403
Treude C, Robillard MP (2017) Understanding stack overflow code fragments. In: International Conference on Software Maintenance and Evolution, pp 509–513
Treude C, Storey MD (2012) Work item tagging: Communicating concerns in collaborative software development. IEEE Trans Softw Eng 38(1):19–34
Treude C, Barzilay O, Storey MD (2011) How do programmers ask and answer questions on the web? In: International Conference on Software Engineering, pp 804–807
Treude C, Robillard MP, Dagenais B (2015) Extracting development tasks to navigate software documentation. IEEE Trans Softw Eng 41(6):565–581
Uddin G, Khomh F (2019) Automatic mining of opinions expressed about apis in stack overflow. IEEE Transactions on Software Engineering, https://doi.org/10.1109/TSE.2019.2900245
Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu PS (2018) Improving automatic source code summarization via deep reinforcement learning. In: International Conference on Automated Software Engineering, pp 397–407
Wan Z, Xia X, Hassan AE (2019) What is discussed about blockchain? a case study on the use of balanced lda and the reference architecture of a domain to capture online discussions about blockchain platforms across the stack exchange communities. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2019.2921343
Wang S, Lo D, Vasilescu B, Serebrenik A (2014) Entagrec: An enhanced tag recommendation system for software information sites. In: International Conference on Software Maintenance and Evolution, pp 291–300
Wang S, Chen T, Hassan AE (2018a) Understanding the factors for fast answers in technical q&a websites - an empirical study of four stack exchange websites. Empir Softw Eng 23(3):1552–1593
Wang S, Lo D, Vasilescu B, Serebrenik A (2018b) Entagrec ++: an enhanced tag recommendation system for software information sites. Empir Softw Eng 23(2):800–832
Wang S, Chen TH, Hassan AE (2020) How do users revise answers on technical q&a websites? a case study on stack overflow. IEEE Trans Softw Eng 46 (9):1024–1038
Wang X, Chen C, Xing Z (2019) Domain-specific machine translation with recurrent neural network for software localization. Empir Softw Eng 24 (6):3514–3545
Wen M, Liu Y, Wu R, Xie X, Cheung SC, Su Z (2019) Exposing library api misuses via mutation analysis. In: International Conference on Software Engineering, pp 866–877
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1(6):80–83
Xia X, Lo D, Wang X, Zhou B (2013) Tag recommendation in software information sites. In: Working Conference on Mining Software Repositories, pp 287–296
Xu B, Ye D, Xing Z, Xia X, Chen G, Li S (2016) Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In: International Conference on Automated Software Engineering, pp 51–62
Xu B, Xing Z, Xia X, Lo D (2017) Answerbot: automated generation of answer summary to developers’ technical questions. In: International Conference on Automated Software Engineering, pp 706–716
Ye X, Shen H, Ma X, Bunescu RC, Liu C (2016) From word embeddings to document similarities for improved information retrieval in software engineering. In: International Conference on Software Engineering, pp 404–415
Zhang F, Niu H, Keivanloo I, Zou Y (2018a) Expanding queries for code search using semantically related API class-names. IEEE Trans Softw Eng 44(11):1070–1082
Zhang H, Jain A, Khandelwal G, Kaushik C, Ge S, Hu W (2016) Bing developer assistant: improving developer productivity by recommending sample code. In: International Symposium on Foundations of Software Engineering, pp 956–961
Zhang H, Wang S, Chen TP, Zou Y, Hassan AE (2019a) An empirical study of obsolete answers on stack overflow. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2019.2906315
Zhang J, Jiang H, Ren Z, Zhang T, Huang Z (2019b) Enriching api documentation with code samples and usage scenarios from crowd knowledge. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2019.2919304
Zhang T, Upadhyaya G, Reinhardt A, Rajan H, Kim M (2018b) Are code examples on an online q&a forum reliable? a study of api misuse on stack overflow. In: International Conference on Software Engineering, pp 886–896
Zhang T, Yang D, Lopes C, Kim M (2019) Analyzing and supporting adaptation of online code examples. In: International Conference on Software Engineering, pp 316–327
Zhao D, Xing Z, Chen C, Xia X, Li G (2019) Actionnet: vision-based workflow action recognition from programming screencasts. In: International Conference on Software Engineering, pp 350–361
Zhong H, Mei H (2019) An empirical study on api usages. IEEE Trans Softw Eng 45(4):319–334. https://doi.org/10.1109/TSE.2017.2782280
Acknowledgments
The authors would like to thank the anonymous reviewers for their constructive comments and suggestions. This work was supported by the NSFC-Key Project of General Technology Fundamental Research United Fund under Grant No. U1736211, No. 61933013, and No. 62032016, the Natural Science Foundation of Guangdong Province under Grant No. 2019A1515011076, the Innovation Group of Guangdong Education Department under Grant No. 2020KCXTD014 and 2018KCXTD019, the Key Project of Natural Science Foundation of Hubei Province under Grant No. 2018CFA024.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: David Lo
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wu, D., Jing, XY., Zhang, H. et al. Generating API tags for tutorial fragments from Stack Overflow. Empir Software Eng 26, 66 (2021). https://doi.org/10.1007/s10664-021-09962-8
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-021-09962-8