ABSTRACT
When being trained on API documentation and tutorials, Word2vec produces vector representations to estimate the relevance between texts and API elements. However, existing Word2vec-based approaches to measure document similarities aggregate Word2vec vectors of individual words or APIs to build the representation of a document as if the words are independent. Thus, the semantics of API descriptions or code fragments are not well represented.
In this work, we introduce D2Vec, a new model that fits with API documentation better than Word2vec. D2Vec is a neural network model that considers two complementary contexts to better capture the semantics of API documentation. We first connect the global context of the current API topic under description to all the text phrases within the description of that API. Second, the local orders of words and API elements in the text phrases are maintained in computing the vector representations for the APIs. We conducted an experiment to verify two intrinsic properties of D2Vec's vectors: 1) similar words and relevant API elements are projected into nearby locations; and 2) some vector operations carry semantics. We demonstrate the usefulness and good performance of D2Vec in three applications: API code search (text-to-code retrieval), API tutorial fragment search (code-to-text retrieval), and mining API mappings between software libraries (code-to-code retrieval). Finally, we provide actionable insights and implications for researchers in using our model in other applications with other types of documents.
- M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Learning natural coding conventions. In Proceedings of the 2014 International Symposium on Foundations of Software Engineering, FSE’14, pages 281–293. ACM, 2014. Google ScholarDigital Library
- M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Suggesting accurate method and class names. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 38–49. ACM, 2015. Google ScholarDigital Library
- M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th IEEE Working Conference on Mining Software Repositories (MSR’13), pages 207–216. IEEE CS, 2013. Google ScholarDigital Library
- M. Allamanis, D. Tarlow, A. Gordon, and Y. Wei. Bimodal modelling of source code and natural language. In Proceedings of the 32nd International Conference on Machine Learning, ICML ’15. ACM, 2015. Google ScholarDigital Library
- J. Anvik, L. Hiew, and G. C. Murphy. Who should fix this bug? In Proceedings of International Conference on Software Engineering, ICSE ’06, pages 361–370. ACM, 2006. Google ScholarDigital Library
- Apache documentation. https://httpd.apache.org/docs/.Google Scholar
- E. Arisoy, T. N. Sainath, B. Kingsbury, and B. Ramabhadran. Deep neural network language models. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, WLM ’12, pages 20–28. Association for Computational Linguistics, 2012. Google ScholarDigital Library
- S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and C. Lopes. Sourcerer: A search engine for open source code supporting structure-based search. In Proceedings of the 2006 ACM International Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA ’06, pages 681–682. ACM, 2006. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, Mar. 2003. Google ScholarCross Ref
- P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation: parameter estimation. Comput. Linguist., 19(2):263–311, June 1993. Google ScholarDigital Library
- W.-K. Chan, H. Cheng, and D. Lo. Searching Connected API Subgraph via Text Phrases. In Proceedings of the 20th International Symposium on the Foundations of Software Engineering, FSE ’12, pages 10:1–10:11. ACM, 2012. Google ScholarDigital Library
- J. Cleland-Huang, O. C. Z. Gotel, J. Huffman Hayes, P. Mäder, and A. Zisman. Software traceability: Trends and future directions. In Proceedings of the Future of Software Engineering workshop, FOSE’14, pages 55–69. ACM, 2014. Google ScholarDigital Library
- B. Dagenais and M. P. Robillard. Recovering traceability links between an API and its learning resources. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pages 47–57. IEEE Press, 2012. Google ScholarDigital Library
- A. Desai, S. Gulwani, V. Hingorani, N. Jain, A. Karkare, M. Marron, S. R, and S. Roy. Program synthesis using natural language. In Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, pages 345–356. ACM, 2016. Google ScholarDigital Library
- A. Gokhale, V. Ganapathy, and Y. Padmanaban. Inferring likely mappings between APIs. In Proceedings of the 35th International Conference on Software Engineering, ICSE ’13, pages 82–91. IEEE, 2013. Google ScholarDigital Library
- X. Gu, H. Zhang, D. Zhang, and S. Kim. Deep API Learning. In Proceedings of the 2016 ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016. ACM, 2016. Google ScholarDigital Library
- L. Guerrouj, D. Bourque, and P. C. Rigby. Leveraging informal documentation to summarize classes and methods in context. In Proceedings of the 37th IEEE/ACM International Conference on Software Engineering, ICSE 2015, Volume 2, pages 639–642. IEEE CS, 2015. Google ScholarDigital Library
- T. Gvero and V. Kuncak. Synthesizing Java expressions from free-form queries. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 416– 432. ACM, 2015. Google ScholarDigital Library
- A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE 2012, pages 837–847. IEEE Press, 2012. Google ScholarDigital Library
- K. Inoue, R. Yokomori, H. Fujiwara, T. Yamamoto, M. Matsushita, and S. Kusumoto. Component rank: Relative significance rank for software component search. In Proceedings of the 25th International Conference on Software Engineering, ICSE ’03, pages 14–24. IEEE, 2003. Google ScholarDigital Library
- Java platform standard edition 7 documentation. http://docs.oracle.com/javase/7/docs/.Google Scholar
- H. Jiang, J. Zhang, X. Li, Z. Ren, and D. Lo. A more accurate model for finding tutorial segments explaining APIs. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), volume 1, pages 157–167, March 2016.Google ScholarCross Ref
- H. Jiang, J. Zhang, Z. Ren, and T. Zhang. An unsupervised approach for discovering relevant tutorial fragments for APIs. In Proceedings of the 39th International Conference on Software Engineering, ICSE ’17, pages 38–48. IEEE Press, 2017. Google ScholarDigital Library
- I. Jolliffe. Principal component analysis. Springer Verlag, New York, 2002.Google Scholar
- Kode java. https://kodejava.org/.Google Scholar
- Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1188–1196, Bejing, China, 22–24 Jun 2014. PMLR. Google ScholarDigital Library
- Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. CoRR, abs/1405.4053, 2014.Google Scholar
- C. J. Maddison and D. Tarlow. Structured generative models of natural source code. In The 31st International Conference on Machine Learning (ICML), June 2014. Google ScholarDigital Library
- C. McMillan, D. Poshyvanyk, and M. Grechanik. Recommending source code examples via API call usages and documentation. In Proceedings of the 2nd International Workshop on Recommendation Systems for Software Engineering, RSSE ’10, pages 21–25. ACM, 2010. Google ScholarDigital Library
- S. Meng, X. Wang, L. Zhang, and H. Mei. A history-based matching approach to identification of framework evolution. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pages 353–363. IEEE, 2012. Google ScholarDigital Library
- T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernocky. Strategies for training large scale neural network language models. In Proceedings of Automatic Speech Recognition and Understanding Workshop, ASRU’11. IEEE, 2011.Google ScholarCross Ref
- T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur. Recurrent neural network based language model. In Proceedings of International Conference on Acoustics Speech and Signal Processing (ICASSP), ICASSP’10, pages 1045–1048. IEEE, 2010.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In 27th Annual Conference on Neural Information Processing Systems 2013 (NIPS’13), pages 3111– 3119, 2013. Google ScholarDigital Library
- L. Mou, G. Li, Z. Jin, L. Zhang, and T. Wang. TBCNN: A tree-based convolutional neural network for programming language processing. CoRR, abs/1409.5718, 2014.Google Scholar
- A. T. Nguyen, H. A. Nguyen, T. T. Nguyen, and T. N. Nguyen. Statistical learning approach for mining API usage mappings for code migration. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, pages 457–468. ACM, 2014. Google ScholarDigital Library
- A. T. Nguyen, P. C. Rigby, T. Nguyen, D. Palani, M. Karanfil, and T. N. Nguyen. Statistical translation of English texts to API code templates. In Proceedings of the 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME ’18. IEEE, 2018. Google ScholarDigital Library
- T. D. Nguyen, A. T. Nguyen, and T. N. Nguyen. Mapping API elements for code migration with vector representations. In Proceedings of the 38th International Conference on Software Engineering Companion, ICSE ’16, pages 756–758. ACM, 2016. Google ScholarDigital Library
- T. D. Nguyen, A. T. Nguyen, H. D. Phan, and T. N. Nguyen. Exploring API embedding for API usages and applications. In Proceedings of the 39th International Conference on Software Engineering, ICSE ’17, pages 438–449. IEEE Press, 2017. Google ScholarDigital Library
- T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In Proceedings of the 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 532–542. ACM, 2013. Google ScholarDigital Library
- M. Nita and D. Notkin. Using twinning to adapt programs to alternative APIs. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, pages 205–214. ACM, 2010. Google ScholarDigital Library
- G. Petrosyan, M. P. Robillard, and R. De Mori. Discovering information explaining API types using text classification. In Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE ’15, pages 869–879. IEEE Press, 2015. Google ScholarDigital Library
- H. Phan, H. A. Nguyen, N. M. Tran, L. H. Truong, A. T. Nguyen, and T. N. Nguyen. Statistical learning of api fully qualified names in code snippets of online forums. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, pages 632–642. ACM, 2018. Google ScholarDigital Library
- D. Puppin and F. Silvestri. The social network of Java classes. In SAC’06, pages 1409–1413. ACM, 2006. Google ScholarDigital Library
- M. Raghothaman, Y. Wei, and Y. Hamadi. SWIM: synthesizing what I mean. In Proceedings of the 38th International Conference on Software Engineering, ICSE’16. ACM Press, 2016. Google ScholarDigital Library
- V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 419–428. ACM, 2014. Google ScholarDigital Library
- P. C. Rigby and M. P. Robillard. Discovering essential code elements in informal documentation. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pages 832–841. IEEE Press, 2013. Google ScholarDigital Library
- A. D. Sorbo, S. Panichella, C. A. Visaggio, M. D. Penta, G. Canfora, and H. C. Gall. Development emails content analyzer: Intention mining in developer discussions. In Proceedings of International Conference on Automated Software Engineering, ASE ’15. IEEE, 2015.Google ScholarDigital Library
- G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker. Towards automatically generating summary comments for Java methods. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, ASE ’10, pages 43–52. ACM, 2010. Google ScholarDigital Library
- S. Subramanian, L. Inozemtseva, and R. Holmes. Live API documentation. In Proceedings of the 36th International Conference on Software Engineering, ICSE ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA Nguyen, Tran, Phan, Nguyen, Truong, Nguyen, Nguyen, and Nguyen 2014, pages 643–652. ACM, 2014. Google ScholarDigital Library
- T. V. Nguyen, A. T. Nguyen, and T. N. Nguyen. Characterizing API elements in software documentation with vector representation. In Proceedings of the 38th International Conference on Software Engineering Companion, ICSE ’16, pages 749–751. ACM, 2016. Google ScholarDigital Library
- W. Wu, Y.-G. Guéhéneuc, G. Antoniol, and M. Kim. Aura: A hybrid approach to identify framework evolution. In Proceedings of the ACM/IEEE International Conference on Software Engineering, ICSE ’10, pages 325–334. ACM, 2010. Google ScholarDigital Library
- J. Yang and L. Tan. Swordnet: Inferring semantically related words from software context. Empirical Softw. Engg., 19(6):1856–1886, Dec. 2014. Google ScholarDigital Library
- X. Ye, R. Bunescu, and C. Liu. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 689–699. ACM, 2014. Google ScholarDigital Library
- X. Ye, H. Shen, X. Ma, R. Bunescu, and C. Liu. From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference on Software Engineering, ICSE ’16, pages 404–415. ACM, 2016. Google ScholarDigital Library
- W. Zheng, Q. Zhang, and M. Lyu. Cross-library API recommendation using web search engines. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE ’11, pages 480–483. ACM, 2011. Google ScholarDigital Library
- H. Zhong, S. Thummalapenta, T. Xie, L. Zhang, and Q. Wang. Mining API mapping for language migration. In Proceedings of International Conference on Software Engineering, ICSE ’10, pages 195–204. ACM, 2010. Google ScholarDigital Library
- J. Zhou, H. Zhang, and D. Lo. Where should the bugs be fixed? - more accurate information retrieval-based bug localization based on bug reports. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pages 14–24. IEEE Press, 2012. Google ScholarDigital Library
Index Terms
Complementing global and local contexts in representing API descriptions to improve API retrieval tasks
Recommendations
Exploring API embedding for API usages and applications
ICSE '17: Proceedings of the 39th International Conference on Software EngineeringWord2Vec is a class of neural network models that as being trained from a large corpus of texts, they can produce for each unique word a corresponding vector in a continuous space in which linguistic contexts of words can be observed. In this work, we ...
Mapping API elements for code migration with vector representations
ICSE '16: Proceedings of the 38th International Conference on Software Engineering CompanionProblem. Code migration between languages is challenging partly because different languages require developers to use different software libraries and frameworks. For example, in Java, Java Development Kit library (JDK) is a popular toolkit while .NET ...
Eclipse API usage: the good and the bad
Today, when constructing software systems, many developers build their systems on top of frameworks. Eclipse is such a framework that has been in existence for over a decade. Like many other evolving software systems, the Eclipse platform has both ...
Comments