Skip to main content
Log in

Weighing lexical information for software clustering in the context of architecture recovery

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

In this paper, we present a software clustering approach that leverages the information conveyed by the zone in which each lexeme appears in the classes of object oriented systems. We define six zones in the source code: Class Name, Attribute Name, Method Name, Parameter Name, Comment, and Source Code Statement. These zones may convey information with different levels of relevance, and so their contribution should be differently weighed according to the software system under study. To this aim, we define a probabilistic model of the lexemes distribution whose parameters are automatically estimated by the Expectation-Maximization algorithm. The weights of the zones are then exploited to compute similarities among source code classes, which are then grouped by a k-Medoid clustering algorithm. To assess the validity of our solution in the software architecture recovery field, we applied our approach to 19 software systems from different application domains. We observed that the use of our probabilistic model and the defined zones improves the quality of clustering results so that they are close to a theoretical upper bound we have proved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. It is equal to the probability of the observed data x given the values of the model parameters 𝜃, i.e., \(\mathcal {L} (\theta | x) = P (x|\theta )\).

  2. A term is a normalized type that is included in the dictionary. A type is the class of all the tokens containing the same character sequence (Manning et al. 2008).

  3. http://www2.unibas.it/gscanniello/software-clustering/

  4. The complete formulation is reported in Saw et al. (1984).

  5. Given two values a and b the means percentage improvement is computed as \(\frac {(b-a)}{a}\).

  6. Each zone has a different vocabulary, so if the same lexeme appears in two or more zones we considered different the terms in all the zones where the lexeme appears.

  7. The Shapiro-Wilk W test returned 0.049 and 0.063 as the p-values for VSM and ZU, respectively. This confirms our postulation on the distribution of data.

  8. The Shapiro-Wilk W test returned 0.063 for ZU and 0.04 for ZWB. On the other hand, this test returned 0.098 for ZWG. Although the results Shapiro-Wilk W test suggest that a parametric test (e.g., unpaired t-test) could be used to verify the presence of a statistically significant difference between the values of authoritativeness obtained with ZWG and ZU, we used the Mann-Whitney test because repeated statistical tests were needed to test Hn2.

References

  • Ali N, Gueheneuc YG, Antoniol G (2011) Requirements traceability for object oriented systems by partitioning source code. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, pp 45–54

  • Andritsos P, Tzerpos V (2005) Information-theoretic software clustering. IEEE Trans Softw Eng 31(2):150–165

    Article  Google Scholar 

  • Anquetil N, Fourrier C, Lethbridge TC (1999) Experiments with clustering as a software remodularization method. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, Washington, pp 235–255

    Google Scholar 

  • Basili VR, Green S, Laitenberger O, Lanubile F, Shull F, Sørumgård LS, Zelkowitz MV (1996) The empirical investigation of perspective-based reading. Empir Softw Eng 1(2):133–164

    Article  Google Scholar 

  • Bavota G, De Lucia A, Marcus A, Oliveto R (2010) Software re-modularization based on structural and semantic metrics. In: Proceedings of international working conference on reverse engineering. IEEE Computer Society, pp 195–204

  • Bavota G, De Lucia A, Marcus A, Oliveto R (2013a) Using structural and semantic measures to improve software modularization. Empir Softw Eng 18 (5):901–932

    Article  Google Scholar 

  • Bavota G, Dit B, Oliveto R, Penta MD, Poshyvanyk D, Lucia AD (2013b) An empirical study on the developers’ perception of software coupling. In: Proceedings of international conference on software engineering. IEEE / ACM, pp 692–701

  • Bavota G, Gethers M, Oliveto R, Poshyvanyk D, De Lucia A (2014a) Improving software modularization via automated analysis of latent topics and dependencies. ACM Trans Softw Eng Methodol 23(1): 4:1–4:33. doi:10.1145/2559935

    Article  Google Scholar 

  • Bavota G, Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2014b) Methodbook: Recommending move method refactorings via relational topic models. IEEE Trans Softw Eng 40(7):671–694

    Article  Google Scholar 

  • Binkley D (2007) Source code analysis: a road map. In: Future of software engineering. IEEE Computer Society, pp 104–119

  • Bishop C (2006) Pattern recognition and machine learning. Information science and statistics. Springer

  • Bittencourt RA, Guerrero DDS (2009) Comparison of graph clustering algorithms for recovering software architecture module views. In: Proceedings of the European conference on software maintenance and reengineering. IEEE Computer Society, pp 251–254

  • Conover WJ (1998) Practical nonparametric statistics, 3rd. Wiley

  • Corazza A, Di Martino S, Maggio V, Moschitti A, Passerini A, Scanniello G, Silvestri F (2013) Using machine learning and information retrieval techniques to improve software maintainability. In: Eternal systems, communications in computer and information science. Springer, Berlin. In Press

    Book  Google Scholar 

  • Corazza A, Di Martino S, Maggio V, Scanniello G (2011) Investigating the use of lexical information for software system clustering. In: Proceedings of European conference on software maintenance and reengineering. IEEE Computer Society, pp 35–44

  • Corazza A, Di Martino S, Scanniello G (2010) A probabilistic based approach towards software system clustering. In: Proceedings of European conference on software maintenance and reengineering. IEEE Computer Society, pp 89–98

  • De Lucia A, Di Penta M, Oliveto R (2011) Improving source code lexicon via traceability and information retrieval. IEEE Trans Softw Eng 37(2):205–227

    Article  Google Scholar 

  • De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2012) Using ir methods for labeling source code artifacts: is it worthwhile? In: Proceedings of international conference on program comprehension. IEEE Computer Society Press, pp 193–202

  • De Lucia A, Risi M, Scanniello G, Tortora G (2009) An investigation of clustering algorithms in the comprehension of legacy web applications. J Web Eng 8(4):346–370

    Google Scholar 

  • Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc Ser B 39 (1):1–38

    MathSciNet  MATH  Google Scholar 

  • van Deursen A, Hofmeister C, Koschke R, Moonen L, Riva C (2004) Symphony: view-driven software architecture reconstruction. In: Proceedings of working conference on software architecture, pp 122–134

  • Ducasse S, Pollet D (2009) Software architecture reconstruction: a process-oriented taxonomy. IEEE Trans Softw Eng 35(4):573–591. doi:10.1109/TSE.2009.19

    Article  Google Scholar 

  • Eastwood A (1993) Firm fires shots at legacy systems. Comput Canada 19(2):17

    MathSciNet  Google Scholar 

  • Erlikh L (2000) Leveraging legacy system dollars for e-business. IT Professional 2:17–23

    Article  Google Scholar 

  • Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press

  • Freund RJ, Wilson WJ (2003) Statistical methods, 2nd edn. Academic Press

  • Grubb P, Takang AA (2003) Software maintenance: concepts and practice, 2nd edn. World Scientific

  • Jarzabek S (2007) Effective software maintenance and evolution—a reuse-based approach. Auerbach Publ

  • Kampenes V, Dyba T, Hannay J, Sjoberg I (2006) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11–12):1073–1086

    Google Scholar 

  • Kaufman L, Rousseeuw P (1990) Finding groups in data an introduction to cluster analysis. Wiley Interscience

  • Kevin Freedman JB (1999) Current concepts review - sample size and statistical power in clinical orthopaedic research. J Bone Joint Surg 81:1454–60

    Google Scholar 

  • Kitchenham B, Al-Khilidar H, Babar M, Berry M, Cox K, Keung J, Kurniawati F, Staples M, Zhang H, Zhu L (2008) Evaluating guidelines for reporting empirical software engineering studies. Empir Softw Eng 13(1):97–121

    Article  Google Scholar 

  • Koschke R (2000) Atomic architectural component recovery for program understanding and evolution. Ph.D. thesis, University of Stuttgart

  • Kuhn A, Ducasse S, Girba T (2005) Enriching reverse engineering with semantic clustering. In: Proceedings of international working conference on reverse engineering. IEEE Computer Society, pp 133–142

  • Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243

    Article  Google Scholar 

  • Liu Y, Poshyvanyk D, Ferenc R, Gyimȯthy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 233–242

  • Mahdavi K (2005) A clustering genetic algorithm for software modularisation with a multiple hill climbing approach. Ph.D. thesis, Department of Information Systems and Computing, Brunel University

    Google Scholar 

  • Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural information. In: Proceedings of international conference on software engineering. IEEE Computer Society, Washington, pp 103–112

    Google Scholar 

  • Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  • Maqbool O, Babri H (2007) Hierarchical clustering for software architecture recovery. IEEE Trans Software Eng 33 (11):759–780

    Article  Google Scholar 

  • Marcus A, Poshyvanyk D (2005) The conceptual cohesion of classes. In: International conference on software maintenance. IEEE Computer Society, pp 133–142

  • Mashiko Y, Basili V (1997) Using the GQM paradigm to investigate influential factors for software process improvement. J Syst Softw 36(1):17–32

    Article  Google Scholar 

  • McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of workshop on learning for text categorization. AAAI Press, pp 41–48

  • Mclachlan J, Krishnan T (1996) The EM algorithm and extensions. Wiley Inter-science

  • Mendonça NC, Kramer J (1996) Requirements for an effective architecture recovery framework. In: Joint proceedings of the second international software architecture workshop and international workshop on multiple perspectives in software development. ACM, pp 101–105

  • Mitchell TM (1997) Machine learning, 1st edn. McGraw-Hill, Inc., New York

    MATH  Google Scholar 

  • Pfleeger SL, Menezes W (2000) Marketing technology to software practitioners. IEEE Softw 17:27–33

    Article  Google Scholar 

  • Port O (1998) The software trap – automate or else. Bus Week 9(3051):142–154

    Google Scholar 

  • Porter MF (1997) An algorithm for suffix stripping. Morgan Kaufmann Publishers Inc., San Francisco, pp 313–316

    Google Scholar 

  • Poshyvanyk D, Marcus A (2006) The conceptual coupling metrics for object-oriented systems. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 469–478

  • Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C, the art of scientific computing, 2nd edn. Cambridge University Press

  • Reggio G, Ricca F, Scanniello G, Di Cerbo F, Dodero G (2011) A precise style for business process modelling: Results from two controlled experiments. In: Proceedings of model driven engineering languages and systems, lecture notes in computer science. Springer, pp 138–152

  • Revelle M, Gethers M, Poshyvanyk D (2011) Using structural and textual information to capture feature coupling in object-oriented software. Empir Softw Eng 16(6):773–811

    Article  Google Scholar 

  • Risi M, Scanniello G, Tortora G (2012) Using fold-in and fold-out in the architecture recovery of software systems. Formal Asp Comput 24(3):307–330

    Article  Google Scholar 

  • Romano S, Scanniello G, Risi M, Gravino C (2011) Clustering and lexical information support for the recovery of design pattern in source code. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 500–503

  • Romesburg H (2004) Cluster analysis for researchers. Lulu Press. http://books.google.it/books?id=ZuIPv7OKm10C

  • Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi:10.1145/361219.361220

    Article  MATH  Google Scholar 

  • Saw JG, Yang MCK, Mo TC (1984) Chebyshev inequality with estimated mean and variance. Am Stat 38(2):130–132

    MathSciNet  Google Scholar 

  • Scanniello G, D’Amico A, D’Amico C, D’Amico T (2010) Using the Kleinberg algorithm and Vector Space Model for software system clustering. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 180–189

  • Scanniello G, Gravino C, Marcus A, Menzies T (2013) Class level fault prediction using software clustering. In: Proceedings of international conference on automated software engineering. IEEE / ACM, pp 640–645

  • Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 1–10

  • Scanniello G, Marcus A, Pascale D (2014) Link analysis algorithms for static concept location: an empirical assessment. Empir Softw Eng 1–55. doi:10.1007/s10664-014-9327-7

  • Scanniello G, Risi M, Tortora G (2010) Architecture recovery using latent semantic indexing and k-means: an empirical evaluation. In: Proceedings of international conference on software engineering and formal methods. IEEE Computer Society, pp 103–112

  • Shapiro S, Wilk M (1965) An analysis of variance test for normality. Biometrika 52(3–4):591–611

    Article  MathSciNet  MATH  Google Scholar 

  • Shtern M, Tzerpos V (2011) Evaluating software clustering using multiple simulated authoritative decompositions. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 353–361

  • Tonella P (2001) Concept analysis for module restructuring. IEEE Trans Softw Eng 27(4):351–363. doi:10.1109/32.917524

    Article  Google Scholar 

  • Tzerpos V, Holt RC (1999) Mojo: A distance metric for software clusterings. In: Proceedings of the working conference of reverse engineering, pp 187–193

  • Wen Z, Tzerpos V (2004) An effectiveness measure for software clustering algorithms. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 194–203

  • Wiggerts TA (1997) Using clustering algorithms in legacy systems remodularization. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, Washington, pp 33–43

    Google Scholar 

  • Wohlin C, Runeson P, Höst M, Ohlsson M, Regnell B, Wesslén A (2000) Experimentation in software engineering - an introduction. Kluwer

  • Wu J, Hassan AE, Holt RC (2005) Comparison of clustering algorithms in the context of software evolution. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 525–535

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Valerio Maggio.

Additional information

Communicated by: Thomas Zimmermann

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Corazza, A., Di Martino, S., Maggio, V. et al. Weighing lexical information for software clustering in the context of architecture recovery. Empir Software Eng 21, 72–103 (2016). https://doi.org/10.1007/s10664-014-9347-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-014-9347-3

Keywords

Navigation