Abstract
Recent studies have applied different approaches for summarizing software artifacts, and yet very few efforts have been made in summarizing the source code fragments available on web. This paper investigates the feasibility of generating code fragment summaries by using supervised learning algorithms.We hire a crowd of ten individuals from the same work place to extract source code features on a corpus of 127 code fragments retrieved from Eclipse and Net- Beans Official frequently asked questions (FAQs). Human annotators suggest summary lines. Our machine learning algorithms produce better results with the precision of 82% and performstatistically better than existing code fragment classifiers. Evaluation of algorithms on several statistical measures endorses our result. This result is promising when employing mechanisms such as data-driven crowd enlistment improve the efficacy of existing code fragment classifiers.
Similar content being viewed by others
References
Haiduc S, Aponte J, Moreno L, Marcus A. On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th Working Conference on Reverse Engineering. 2010, 35–44
Cutrell E, Guan ZW. What are you looking for?: an eye-tracking study of information usage in Web search. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2007, 407–416
Ying A T T, Robillard M P. Code fragment summarization. In: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. 2013, 655–658
Haiduc S, Aponte J, Marcus A. Supporting program comprehension with source code summarization. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering. 2010, 223–226
Eddy B P, Robinson J A, Kraft N A, Carver J C. Evaluating source code summarization techniques: replication and expansion. In: Proceedings of the 21st IEEE International Conference on Program Comprehension. 2013, 13–22
Moreno L, Aponte J. On the analysis of human and automatic summaries of source code. CLEI Electronic Journal, 2012, 15(2): 2
Rastkar S, Murphy G C, Bradley A W J. Generating natural language summaries for crosscutting source code concerns. In: Proceedings of the 27th IEEE International Conference on Software Maintenance. 2011, 103–112
Moreno L, Aponte J, Sridhara G, Marcus A, Pollock L, Vijay-Shanker K. Automatic generation of natural language summaries for Java classes. In: Proceedings of the 21st IEEE International Conference on Program Comprehension. 2013, 23–32
Moreno L, Marcus A, Pollock L, Vijay-Shanker K. JSummarizer: an automatic generator of natural language summaries for Java classes. In: Proceedings of the 21st IEEE International Conference on Program Comprehension. 2013, 230–232
Sridhara G, Hill E, Muppaneni D, Pollock L, Vijay-Shanker K. Towards automatically generating summary comments for Java methods. In: Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering. 2010, 43–52
Jiang H, Xuan J F, Ren Z L, Wu Y X, Wu X D. Misleading classification. Science China Information Sciences, 2014, 57(5): 1–17
Rastkar S, Murphy G C, Murray G. Summarizing software artifacts: a case study of bug reports. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering. 2010, 505–514
Rastkar S, Murphy G C, Murray G. Automatic summarization of bug reports. IEEE Transactions on Software Engineering, 2014, 40(4): 366–380
Mani S, Catherine R, Sinha V S, Dubey A. Ausum: approach for unsupervised bug report summarization. In: Proceedings of the 20th ACM SIGSOFT International Symposium on the Foundations of Software Engineering. 2012, 1–11
Radev D R, Jing H Y, Styś M, Tam D. Centroid-based summarization of multiple documents. Information Processing and Management, 2004, 40(6): 919–938
Carbonell J, Goldstein J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1998, 335–336
Zhu X J, Goldberg A B, Gael J V, Andrzejewski D. Improving diversity in ranking using absorbing random walks. In: Proceedings of Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. 2007, 97–104
Mei Q Z, Guo J, Radev D. Divrank: the interplay of prestige and diversity in information networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 1009–1018
Lotufo R, Malik Z, Czarnecki K. Modelling the ‘Hurried’ bug report reading process to summarize bug reports. In: Proceedings of the 28th IEEE International Conference on Software Maintenance. 2012, 430–439
Xuan J F, Jiang H, Hu Y, Ren Z L, Zou W Q, Luo Z X, Wu X D. Towards effective bug triage with software data reduction techniques. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(1): 264–280
Xuan J F, Jiang H, Ren Z L, Luo Z X. Solving the large scale next release problem with a backbone-based multilevel algorithm. IEEE Transactions on Software Engineering, 2012, 38(5): 1195–1212
Lloret E, Plaza L, Aker A. Analyzing the capabilities of crowdsourcing services for text summarization. Language Resources and Evaluation, 2013, 47(2): 337–369
Hong S G, Shin S, Yi M Y. Contextual keyword extraction by building sentences with crowdsourcing. Multimedia Tools Applications, 2014, 68(2): 401–412
Mizuyama H, Yamashita K, Hitomi K, Anse M. A prototype crowdsourcing approach for document summarization service. Sustainable Production and Service Supply Chains. 2013, 415: 435–442
Carletta J. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics, 1996, 22(2): 249–254
Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960, 20(1): 37
Zhao Y X, Zhu Q H. Evaluation on crowdsourcing research: current status and future direction. Information Systems Frontiers, 2014, 16(3): 417–434
Howe J. The rise of crowdsourcing. Wired Magazine, 2006, 14(6): 1–4
Greengard S. Following the crowd. Communications of the ACM, 2011, 54(2): 20–22
Riedl C, Blohm I, Leimeister J M, Krcmar H. Rating scales for collective intelligence in innovation communities: why quick and easy decision making does not get it right. In: Proceedings of the International Conference on Information Systems. 2010, 52
Whitla P. Crowdsourcing and its application in marketing activities. Contemporary Management Research, 2009, 5(1): 15–28
Hsueh P Y, Melville P, Sindhwani V. Data quality from crowdsourcing: a study of annotation selection criteria. In: Proceedings of the NAACL HLT 2009 workshop on active learning for natural language processing. 2009, 27–35
Allahbakhsh M, Benatallah B, Ignjatovic A, Motahari-Nezhad H R, Bertino E, Dustdar S. Quality control in crowdsourcing systems: issues and directions. IEEE Internet Computing, 2013, 17(2): 76–81
Lofi C, Selke J, Balke W T. Information extraction meets crowdsourcing: a promising couple. Datenbank-Spektrum, 2012, 12(2): 109–120
Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 27
Fawcett T. Roc graphs: notes and practical considerations for researchers. Machine Learning, 2004, 31: 1–38
Hassan S, Rafi M, Shaikh M S. Comparing SVM and naive bayes classifiers for text categorization with wikitology as knowledge enrichment. In: Proceedings of 2011 IEEE 14th International Multitopic Conference. 2011, 31–34
Jaakkola T, Diekhans M, Haussler D. Using the fisher kernel method to detect remote protein homologies. In: Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology. 1999, 149–158
Chen Y W, Lin C J. Combining SVMs with various feature selection strategies. Studies in Fuzziness and Soft Computing, 2006, 207: 315–324
Author information
Authors and Affiliations
Corresponding author
Additional information
Najam Nazar received his BS degree in Computer Science from University of the Punjab, Lahore, Pakistan in 2005 and MS degree in Software Engineering from Chalmers University of Technology, Sweden in 2010. He is currently working towards his PhD degree in Software Engineering at Dalian University of Technology, China. His current research interest includes mining software repositories, data mining, natural language processing, machine learning, software product lines, and agile methodologies.
He Jiang received the PhD degree in computer science from the University of Science and Technology of China, China. He is currently a Professor in Dalian University of Technology, China. His current research interests include computational intelligence and its applications in software engineering and data mining. He is also a member of the ACM and the CCF.
Guojun Gao received his Bachelor’s Degree in Software Engineering from School of Software, Dalian University of Technology, China in 2014. Currently, he is pursuing MS degree in Software Engineering from the same university. His research interests include the defects prediction, detection in software engineering.
Tao Zhang received the BE, ME degrees in Automation and Software Engineering from Northeastern University, China, in 2005 and 2008, respectively. He received the PhD degree in Computer Science from University of Seoul, South Korea in 2013. He was a research professor at the University of Seoul, South Korea from 2013 to 2014. Currently, he is a postdoctoral fellow at the Hong Kong Polytechnic University, China. His research interest includes mining software maintenance, security and privacy for mobile apps, and recommendation systems.
Xiaochen Li received the BS degree in software engineering from the Dalian University of Technology, China in 2015. He is currently a PhD candidate in Dalian University of Technology. His current research interest is mining software repositories in software engineering.
Zhilei Ren received the BS degree in Software Engineering and the PhD degree in computational mathematics from the Dalian University of Technology, China in 2007 and 2013, respectively. He is currently a lecturer in Dalian University of Technology. His current research interests include evolutionary computation and its applications in software engineering. He is a member of the ACM and the CCF.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Nazar, N., Jiang, H., Gao, G. et al. Source code fragment summarization with small-scale crowdsourcing based features. Front. Comput. Sci. 10, 504–517 (2016). https://doi.org/10.1007/s11704-015-4409-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-015-4409-2