ABSTRACT
Researchers have employed a variety of techniques to extract underlying topics that relate to software development artifacts. Typically, these techniques use semi-unsupervised machine-learning algorithms to suggest candidate word-lists. However, word-lists are difficult to interpret in the absence of meaningful summary labels. Current topic modeling techniques assume manual labelling and do not use domainspecific knowledge to improve, contextualize, or describe results for the developers. We propose a solution: automated labelled topic extraction. Topics are extracted using Latent Dirichlet Allocation (LDA) from commit-log comments recovered from source control systems such as CVS and Bit-Keeper. These topics are given labels from a generalizable cross-project taxonomy, consisting of non-functional requirements. Our approach was evaluated with experiments and case studies on two large-scale RDBMS projects: MySQL and MaxDB. The case studies show that labelled topic extraction can produce appropriate, context-sensitive labels relevant to these projects, which provides fresh insight into their evolving software development activities.
- J. Aranda and G. Venolia. The secret life of bugs: Going past the errors and omissions in software repositories. In International Conference on Software Engineering, pages 298--308. IEEE, Sep 2009. Google ScholarDigital Library
- P. F. Baldi, C. V. Lopes, E. J. Linstead, and S. K. Bajracharya. A theory of aspects as latent topics. In Conference on Object Oriented Programming Systems Languages and Applications, pages 543--562, Nashville, 2008. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(4--5):993--1022, May 2003. Google ScholarDigital Library
- B. Boehm, J. R. Brown, and M. Lipow. Quantitative Evaluation of Software Quality. In International Conference on Software Engineering, pages 592--605, 1976. Google ScholarDigital Library
- J. Cleland-Huang, R. Settimi, X. Zou, and P. Solc. The Detection and Classification of Non-Functional Requirements with Application to Early Aspects. In International Requirements Engineering Conference, pages 39--48, Minneapolis, Minnesota, 2006. Google ScholarDigital Library
- N. A. Ernst and J. Mylopoulos. On the perception of software quality requirements during the project lifecycle. In International Working Conference on Requirements Engineering: Foundation for Software Quality, Essen, Germany, June 2010.Google ScholarCross Ref
- T. Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861--874, 2006. Google ScholarDigital Library
- C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, 1998.Google ScholarCross Ref
- S. Few. Information Dashboard Design: The Effective Visual Communication of Data. O'Reilly Media, 1 edition, Jan. 2006. Google ScholarDigital Library
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1):10--18, 2009. Google ScholarDigital Library
- A. Hindle, M. W. Godfrey, and R. C. Holt. What's hot and what's not: Windowed developer topic analysis. In International Conference on Software Maintenance, pages 339--348, Edmonton, Alberta, Canada, September 2009.Google ScholarCross Ref
- Software engineering -- Product quality -- Part 1: Quality model. Technical report, International Standards Organization - JTC 1/SC 7, 2001.Google Scholar
- A. Kayed, N. Hirzalla, A. Samhan, and M. Alfayoumi. Towards an ontology for software product quality attributes. In International Conference on Internet and Web Applications and Services, pages 200--204, May 2009. Google ScholarDigital Library
- R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In International Joint Conference On Artificial Intelligence, pages 1137--1143, Toronto, 1995. Google ScholarDigital Library
- A. Marcus, A. Sergeyev, V. Rajlich, and J. Maletic. An information retrieval approach to concept location in source code. In 11th Working Conference on Reverse Engineering, pages 214--223, November 2004. Google ScholarDigital Library
- J. McCall. Factors in Software Quality: Preliminary Handbook on Software Quality for an Acquisiton Manager, volume 1--3. General Electric, November 1977.Google ScholarCross Ref
- Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In International Conference on Knowledge Discovery and Data Mining, pages 490--499, San Jose, California, 2007. Google ScholarDigital Library
- A. Mockus and L. Votta. Identifying reasons for software changes using historic databases. In International Conference on Software Maintenance, pages 120--130, San Jose, CA, 2000. Google ScholarDigital Library
- C. Treude and M.-A. Storey. ConcernLines: A timeline view of co-occurring concerns. In International Conference on Software Engineering, pages 575--578, Vancouver, May 2009. Google ScholarDigital Library
- G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In O. Maimon and L. Rokach, editors, Data Mining and Knowledge Discovery Handbook. Spring, 2nd edition, 2010.Google Scholar
Index Terms
- Automated topic naming to support cross-project analysis of software maintenance activities
Recommendations
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementTopic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Automated topic naming
Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semi-unsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA). ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Comments