Skip to main content
Log in

Automated topic naming

Supporting cross-project analysis of software maintenance activities

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semi-unsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA). LDA is used for concept and topic analysis to suggest candidate word-lists or topics that describe and relate software artifacts. However, these word-lists and topics are difficult to interpret in the absence of meaningful summary labels. Current attempts to interpret topics assume manual labelling and do not use domain-specific knowledge to improve, contextualize, or describe results for the developers. We propose a solution: automated labelled topic extraction. Topics are extracted using LDA from commit-log comments recovered from source control systems. These topics are given labels from a generalizable cross-project taxonomy, consisting of non-functional requirements. Our approach was evaluated with experiments and case studies on three large-scale Relational Database Management System (RDBMS) projects: MySQL, PostgreSQL and MaxDB. The case studies show that labelled topic extraction can produce appropriate, context-sensitive labels that are relevant to these projects, and provide fresh insight into their evolving software development activities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Generated using David A. Wheeler’s SLOCCount, http://dwheeler.com/sloccount.

  2. http://www.sdn.sap.com/irj/sdn/maxdb

  3. http://www.postgresql.org/docs/7.3/static/

  4. NLTK: http://www.nltk.org/.

  5. For our word lists visit http://softwareprocess.es/nomen/.

  6. Since the MySQL and MaxDB data had poor records for developer ids, we focused on PostgreSQL.

References

  • Baldi PF, Lopes CV, Linstead EJ, Bajracharya SK (2008) A theory of aspects as latent topics. In: Conference on object oriented programming systems languages and applications, pp  543–562. Nashville

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(4–5):993–1022. doi:10.1162/jmlr.2003.3.4-5.993

    MATH  Google Scholar 

  • Bøegh J (2008) A new standard for quality requirements. IEEE Software 25(2):57–63. doi:10.1109/MS.2008.30

    Article  Google Scholar 

  • Boehm B, Brown JR, Lipow M (1976) Quantitative evaluation of software quality. In: International conference on software engineering, pp 592–605

  • Chung L, Nixon BA, Yu ES, Mylopoulos J (1999) Non-functional requirements in software engineering. In: International series in software engineering, vol 5. Kluwer Academic, Boston

    Google Scholar 

  • Cleland-Huang J, Settimi R, Zou X, Solc P (2006) The detection and classification of non-functional requirements with application to early aspects. In: International requirements engineering conference, pp 39–48. Minneapolis, Minnesota. doi:10.1109/RE.2006.65

  • Ernst NA, Mylopoulos J (2010) On the perception of software quality requirements during the project lifecycle. In: International working conference on requirements engineering: foundation for software quality. Essen, Germany

    Google Scholar 

  • Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  • Fellbaum C (ed) (1998) WordNet: an electronic lexical database. MIT Press, Cambridge

    MATH  Google Scholar 

  • Few S (2006) Information dashboard design: the effective visual communication of data, 1st edn. O’Reilly Media. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0596100167

  • Flach P (2003) The geometry of roc space: understanding machine learning metrics through roc isometrics. In: Proc. 20th international conference on machine learning (ICML’03). AAAI Press, pp 194–201. URL http://www.cs.bris.ac.uk/Publications/Papers/1000704.pdf

  • Forman G, Scholz M (2010) Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. SIGKDD Explor Newsl 12:49–57. doi:10.1145/1882471.1882479

    Article  Google Scholar 

  • German DM (2003) The GNOME project: a case study of open source, global software development. Softw Process Improv Pract 8(4):201–215. doi:10.1002/spip.189

    Article  Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations 11(1):10–18. URL http://www.kdd.org/explorations/issues/11-1-2009-07/p2V11n1.pdf

    Article  Google Scholar 

  • Hindle A, Godfrey MW, Holt RC (2007) Release pattern discovery via partitioning: methodology and case study. In: International workshop on mining software repositories at ICSE, pp 19–27. Minneapolis, MN. doi:10.1109/MSR.2007.28

  • Hindle A, German DM, Holt R (2008) What do large commits tell us?: a taxonomical study of large commits. In: MSR ’08: Proceedings of the 2008 international working conference on mining software repositories. ACM, New York, pp 99–108. doi:10.1145/1370750.1370773

    Chapter  Google Scholar 

  • Hindle A, Godfrey MW, Holt RC (2009) What’s hot and what’s not: windowed developer topic analysis. In: International conference on software maintenance, pp 339–348. Edmonton, Alberta, Canada. doi:10.1109/ICSM.2009.5306310

    Google Scholar 

  • Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. In: International conference on mining software repositories

  • ISO (2001) Software engineering—product quality—part 1: quality model. Tech. rep., International Standards Organization - JTC 1/SC 7

  • Kayed A, Hirzalla N, Samhan A, Alfayoumi M (2009) Towards an ontology for software product quality attributes. In: International conference on internet and web applications and services, pp 200–204. doi:10.1109/ICIW.2009.36

  • Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence, pp 1137–1143. Toronto. URL http://portal.acm.org/citation.cfm?id=1643047

  • Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: 11th working conference on reverse engineering, pp 214–223. doi:10.1109/WCRE.2004.10

  • Massey B (2002) Where do open source requirements come from (and what should we do about it)? In: Workshop on Open source software engineering at ICSE. Orlando, FL, USA

  • McCall J (1977) Factors in software quality: preliminary handbook on software quality for an acquisiton manager, vols 1–3. General Electric. URL http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA049055

  • Mei Q, Shen X, Zhai C (2007) Automatic labeling of multinomial topic models. In: International conference on knowledge discovery and data mining, pp 490–499. San Jose, California. doi:10.1145/1281192.1281246

  • Mockus A, Votta L (2000) Identifying reasons for software changes using historic databases. In: International conference on software maintenance, pp 120–130. San Jose, CA. doi:10.1109/ICSM.2000.883028. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=883028

  • Scacchi W, Jensen C, Noll J, Elliott M (2005) Multi-modal modeling, analysis and validation of open source software requirements processes. In: International conference on open source systems, vol 1, pp 1–8. Genoa, Italy

  • Treude C, Storey MA (2009) ConcernLines: a timeline view of co-occurring concerns. In: International conference on software engineering, pp 575–578. Vancouver

  • Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook, 2nd edn. Springer

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abram Hindle.

Additional information

Editors: Tao Xie, Thomas Zimmermann and Arie van Deursen

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hindle, A., Ernst, N.A., Godfrey, M.W. et al. Automated topic naming. Empir Software Eng 18, 1125–1155 (2013). https://doi.org/10.1007/s10664-012-9209-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-012-9209-9

Keywords

Navigation