research-article

Automated topic naming to support cross-project analysis of software maintenance activities

Authors:
Abram Hindle

University of California, Davis, Davis, USA

University of California, Davis, Davis, USA
View Profile

,
Neil A. Ernst

University of Toronto, Toronto, Canada

University of Toronto, Toronto, Canada
View Profile

,
Michael W. Godfrey

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada
View Profile

,
John Mylopoulos

University of Trento, Trento, Italy

University of Trento, Trento, Italy
View Profile

MSR '11: Proceedings of the 8th Working Conference on Mining Software RepositoriesMay 2011Pages 163–172https://doi.org/10.1145/1985441.1985466

Published:21 May 2011Publication History

MSR '11: Proceedings of the 8th Working Conference on Mining Software Repositories

Pages 163–172

ABSTRACT

Researchers have employed a variety of techniques to extract underlying topics that relate to software development artifacts. Typically, these techniques use semi-unsupervised machine-learning algorithms to suggest candidate word-lists. However, word-lists are difficult to interpret in the absence of meaningful summary labels. Current topic modeling techniques assume manual labelling and do not use domainspecific knowledge to improve, contextualize, or describe results for the developers. We propose a solution: automated labelled topic extraction. Topics are extracted using Latent Dirichlet Allocation (LDA) from commit-log comments recovered from source control systems such as CVS and Bit-Keeper. These topics are given labels from a generalizable cross-project taxonomy, consisting of non-functional requirements. Our approach was evaluated with experiments and case studies on two large-scale RDBMS projects: MySQL and MaxDB. The case studies show that labelled topic extraction can produce appropriate, context-sensitive labels relevant to these projects, which provides fresh insight into their evolving software development activities.

References

J. Aranda and G. Venolia. The secret life of bugs: Going past the errors and omissions in software repositories. In International Conference on Software Engineering, pages 298--308. IEEE, Sep 2009. Google ScholarDigital Library
P. F. Baldi, C. V. Lopes, E. J. Linstead, and S. K. Bajracharya. A theory of aspects as latent topics. In Conference on Object Oriented Programming Systems Languages and Applications, pages 543--562, Nashville, 2008. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(4--5):993--1022, May 2003. Google ScholarDigital Library
B. Boehm, J. R. Brown, and M. Lipow. Quantitative Evaluation of Software Quality. In International Conference on Software Engineering, pages 592--605, 1976. Google ScholarDigital Library
J. Cleland-Huang, R. Settimi, X. Zou, and P. Solc. The Detection and Classification of Non-Functional Requirements with Application to Early Aspects. In International Requirements Engineering Conference, pages 39--48, Minneapolis, Minnesota, 2006. Google ScholarDigital Library
N. A. Ernst and J. Mylopoulos. On the perception of software quality requirements during the project lifecycle. In International Working Conference on Requirements Engineering: Foundation for Software Quality, Essen, Germany, June 2010.Google ScholarCross Ref
T. Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861--874, 2006. Google ScholarDigital Library
C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, 1998.Google ScholarCross Ref
S. Few. Information Dashboard Design: The Effective Visual Communication of Data. O'Reilly Media, 1 edition, Jan. 2006. Google ScholarDigital Library
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1):10--18, 2009. Google ScholarDigital Library
A. Hindle, M. W. Godfrey, and R. C. Holt. What's hot and what's not: Windowed developer topic analysis. In International Conference on Software Maintenance, pages 339--348, Edmonton, Alberta, Canada, September 2009.Google ScholarCross Ref
Software engineering -- Product quality -- Part 1: Quality model. Technical report, International Standards Organization - JTC 1/SC 7, 2001.Google Scholar
A. Kayed, N. Hirzalla, A. Samhan, and M. Alfayoumi. Towards an ontology for software product quality attributes. In International Conference on Internet and Web Applications and Services, pages 200--204, May 2009. Google ScholarDigital Library
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In International Joint Conference On Artificial Intelligence, pages 1137--1143, Toronto, 1995. Google ScholarDigital Library
A. Marcus, A. Sergeyev, V. Rajlich, and J. Maletic. An information retrieval approach to concept location in source code. In 11th Working Conference on Reverse Engineering, pages 214--223, November 2004. Google ScholarDigital Library
J. McCall. Factors in Software Quality: Preliminary Handbook on Software Quality for an Acquisiton Manager, volume 1--3. General Electric, November 1977.Google ScholarCross Ref
Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In International Conference on Knowledge Discovery and Data Mining, pages 490--499, San Jose, California, 2007. Google ScholarDigital Library
A. Mockus and L. Votta. Identifying reasons for software changes using historic databases. In International Conference on Software Maintenance, pages 120--130, San Jose, CA, 2000. Google ScholarDigital Library
C. Treude and M.-A. Storey. ConcernLines: A timeline view of co-occurring concerns. In International Conference on Software Engineering, pages 575--578, Vancouver, May 2009. Google ScholarDigital Library
G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In O. Maimon and L. Rokach, editors, Data Mining and Knowledge Discovery Handbook. Spring, 2nd edition, 2010.Google Scholar

Index Terms

Automated topic naming to support cross-project analysis of software maintenance activities
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems
      1. Project and people management
2. Software and its engineering
  1. Software creation and management
    1. Designing software
      1. Requirements analysis
    2. Software development process management
  2. Software notations and tools
    1. Software configuration management and version control systems

Recommendations

Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Read More
Automated topic naming

Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semi-unsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA). ...
Read More
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MSR '11: Proceedings of the 8th Working Conference on Mining Software Repositories
May 2011
260 pages
ISBN:9781450305747
DOI:10.1145/1985441
General Chair:
Arie van Deursen
Delft University of Technology, The Netherlands
,
Program Chairs:
Tao Xie
North Carolina State University, USA
,
Thomas Zimmermann
Microsoft Research, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 May 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
lda
non-functional requirements
topic analysis
Qualifiers
- research-article
Conference

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 59
  Total Citations
  View Citations
- 617
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automated topic naming to support cross-project analysis of software maintenance activities

MSR '11: Proceedings of the 8th Working Conference on Mining Software Repositories

ABSTRACT

References

Cited By

Index Terms

Recommendations

Topic analysis for topic-focused multi-document summarization

Automated topic naming

Research on Multi-document Summarization Based on LDA Topic Model