Elsevier

Science of Computer Programming

Volume 80, Part B, 1 February 2014, Pages 457-479
Science of Computer Programming

Studying software evolution using topic models

https://doi.org/10.1016/j.scico.2012.08.003Get rights and content
Under an Elsevier user license
open archive

Abstract

Topic models are generative probabilistic models which have been applied to information retrieval to automatically organize and provide structure to a text corpus. Topic models discover topics in the corpus, which represent real world concepts by frequently co-occurring words. Recently, researchers found topics to be effective tools for structuring various software artifacts, such as source code, requirements documents, and bug reports. This research also hypothesized that using topics to describe the evolution of software repositories could be useful for maintenance and understanding tasks. However, research has yet to determine whether these automatically discovered topic evolutions describe the evolution of source code in a way that is relevant or meaningful to project stakeholders, and thus it is not clear whether topic models are a suitable tool for this task.

In this paper, we take a first step towards evaluating topic models in the analysis of software evolution by performing a detailed manual analysis on the source code histories of two well-known and well-documented systems, JHotDraw and jEdit. We define and compute various metrics on the discovered topic evolutions and manually investigate how and why the metrics evolve over time. We find that the large majority (87%–89%) of topic evolutions correspond well with actual code change activities by developers. We are thus encouraged to use topic models as tools for studying the evolution of a software system.

Highlights

► We apply an advanced IR technique, called topic models, to source code histories. ► High-level topic evolutions are created that describe the changes to source code. ► We examine whether the topic evolutions are accurate and meaningful to developers. ► After two case studies, we conclude that topic models are mostly accurate and meaningful.

Keywords

Software evolution
Topic model
Latent Dirichlet allocation
Mining software repositories

Cited by (0)